download notebook
view notebook w/o solutions
Multidimensional data in pandas
Files needed = ('nipa.xlsx', 'CPS_March_2016.csv')
We have covered some pandas basics and learned how to plot. Now let's sort out how to deal with more complex data. We will often find ourselves with data in which the unit of observation is complex. Pandas helps us deal with this by allowing for many index variables. So far, we have only used single indexing, but that is about to change.
Some examples that could use a multiIndex
- State and county
- Team and player
- Industry and firm
- Country (or person, firm,...) and time
That last one is important, and one that shows up a lot in economics. We call this panel data. Panel data is sometimes called longitudinal data. It follows the same firm/person/country over time.
MultiIndexes are important. Here are a few applications.
- When the multiIndex is set correctly, we can use methods such as
.loc[]
in more powerful ways to retrieve subsets of data. - The multiIndex is important when we want to "reshape" DataFrames to create Dataframes with observations as rows and variables as columns.
- The multiIndex helps us keep our data neat and organized.
import pandas as pd # load pandas and shorten it to pd
import matplotlib.pyplot as plt # for making figures
m = {'name':['madison', 'madison', 'madison', 'seattle', 'seattle', 'seattle', 'phoenix', 'phoenix', 'phoenix'],
'year':[2020, 2021, 2022,2020, 2021, 2022,2020, 2021, 2022 ],
'temp-aug':[80,84,86,68,66,70,108,109,110],
'pop':[267,268,269,735,734,733,4850,4900,4948]
}
city = pd.DataFrame(m)
city
name | year | temp-aug | pop | |
---|---|---|---|---|
0 | madison | 2020 | 80 | 267 |
1 | madison | 2021 | 84 | 268 |
2 | madison | 2022 | 86 | 269 |
3 | seattle | 2020 | 68 | 735 |
4 | seattle | 2021 | 66 | 734 |
5 | seattle | 2022 | 70 | 733 |
6 | phoenix | 2020 | 108 | 4850 |
7 | phoenix | 2021 | 109 | 4900 |
8 | phoenix | 2022 | 110 | 4948 |
Multiple indexing
The key to working with more complex datasets is getting the index right. So far, we have considered a single index, but pandas allows for multiple indexes that nest each other.
Key concept: Hierarchical indexing takes multiple levels of indexes.
Let's set up the DataFrame to take name
and year
as the indexes.
city.set_index(['name', 'year'], inplace=True)
city
temp-aug | pop | ||
---|---|---|---|
name | year | ||
madison | 2020 | 80 | 267 |
2021 | 84 | 268 | |
2022 | 86 | 269 | |
seattle | 2020 | 68 | 735 |
2021 | 66 | 734 | |
2022 | 70 | 733 | |
phoenix | 2020 | 108 | 4850 |
2021 | 109 | 4900 | |
2022 | 110 | 4948 |
Wow.
Notice that the set_index()
method is the same one we used earlier with single indexes. In this case, we passed it a list of variables to make the indexes
city.set_index(['name', 'year'], inplace=True)
In the output, the highest level of the index is name (we passed it 'name' first in the list) and the second level is year. The output does not repeat the city name for each observation. The 'missing' city name just means that the city is the same as above. [A very Tufte-esque removal of unnecessary ink.]
Let's take a look under the hood. What's our index? A new kind of object: the MultiIndex
print(city.index)
MultiIndex([('madison', 2020),
('madison', 2021),
('madison', 2022),
('seattle', 2020),
('seattle', 2021),
('seattle', 2022),
('phoenix', 2020),
('phoenix', 2021),
('phoenix', 2022)],
names=['name', 'year'])
Subsetting with multiple indexes
With a multiIndex, we need two arguments to reference observations. Notice that I am using a tuple to pass the two values of the multiIndex.
# All the Madison data
city.loc[('madison', 2022),:]
temp-aug 86
pop 269
Name: (madison, 2022), dtype: int64
Partial indexing
With the indexes set, we can easily subset the data using only one of the indexes. In pandas, this is called partial indexing because we are only using part of the index to identify the data we want.
We use the xs()
method of DataFrame. Here we specify which level we are looking into. Note that I can reference the levels either by an integer or by its name.
# Get all of the 2021 observations.
city.xs(2021, level = 1)
temp-aug | pop | |
---|---|---|
name | ||
madison | 84 | 268 |
seattle | 66 | 734 |
phoenix | 109 | 4900 |
# Get all of the 2022 observations.
city.xs(2022, level = 'year')
temp-aug | pop | |
---|---|---|
name | ||
madison | 86 | 269 |
seattle | 70 | 733 |
phoenix | 110 | 4948 |
We just took a slice from our data for just one year. This is called a cross section. The function we used is called xs
for "x section" or "cross section."
Notice that .xs()
dropped the index level that we chose from. If we want to keep that level we use the drop_level
option.
# Get all of the 2022 observations. Keep the year variable.
city.xs(2022, level = 'year', drop_level=False)
temp-aug | pop | ||
---|---|---|---|
name | year | ||
madison | 2022 | 86 | 269 |
seattle | 2022 | 70 | 733 |
phoenix | 2022 | 110 | 4948 |
With xs()
, we can partially index on the 'outer index' as well. Suppose we want all the Madison observations.
# Get all of the Madison observations.
city.xs('madison', level = 'name', drop_level=False)
temp-aug | pop | ||
---|---|---|---|
name | year | ||
madison | 2020 | 80 | 267 |
2021 | 84 | 268 | |
2022 | 86 | 269 |
Limits to .xs()
.xs()
can only search for one value in a level. This code, which looks reasonable, will not work.
city.xs(['madison', 'seattle'], level='name')
will return an error. I'm not sure why this is not allowed.
Resetting the multiIndex
As with a single index, we can get rid of the multIndex and replace it with a generic list of integers. This adds the index levels back into the DataFrame as columns.
# I'm intentionally using inplace=False because I do not want to reset the index permanently.
# This code will just display a copy of the dataFrame.
city.reset_index(inplace=False)
name | year | temp-aug | pop | |
---|---|---|---|---|
0 | madison | 2020 | 80 | 267 |
1 | madison | 2021 | 84 | 268 |
2 | madison | 2022 | 86 | 269 |
3 | seattle | 2020 | 68 | 735 |
4 | seattle | 2021 | 66 | 734 |
5 | seattle | 2022 | 70 | 733 |
6 | phoenix | 2020 | 108 | 4850 |
7 | phoenix | 2021 | 109 | 4900 |
8 | phoenix | 2022 | 110 | 4948 |
Saving multiIndex DataFrames
Saving a multiIndexed DataFrame works like before. Pandas fills in all the repeated labels, so the output is ready to go. Run the following code and then open the csv files.
# Multiple indexes on rows
city.to_csv('city.csv', index=True)
Reading multiIndex DataFrames
We can set up the multiIndex as we read in a file, too.
If the multiIndex is on the rows, pass index_col
a list of column names.
city_readin = pd.read_csv('city.csv', index_col=['name', 'year'])
city_readin
temp-aug | pop | ||
---|---|---|---|
name | year | ||
madison | 2020 | 80 | 267 |
2021 | 84 | 268 | |
2022 | 86 | 269 | |
seattle | 2020 | 68 | 735 |
2021 | 66 | 734 | |
2022 | 70 | 733 | |
phoenix | 2020 | 108 | 4850 |
2021 | 109 | 4900 | |
2022 | 110 | 4948 |
Practice:
Use the city_readin DataFrame from above to answer these questions.
- What was the August temperature in Phoenix in 2022? Use
.loc[]
to return the temperature.
city_readin.loc[('phoenix', 2022), 'temp-aug']
110
We can use .loc[]
to partially index only the outermost row.
Unlike .xs()
we can pass it lists.
- Try
city_readin.loc['madison']
what does it return?
city_readin.loc['madison']
temp-aug | pop | |
---|---|---|
year | ||
2020 | 80 | 267 |
2021 | 84 | 268 |
2022 | 86 | 269 |
- Return all of the observations for Madison and Seattle. Use
.loc[]
.
city_readin.loc[['madison', 'seattle']]
temp-aug | pop | ||
---|---|---|---|
name | year | ||
madison | 2020 | 80 | 267 |
2021 | 84 | 268 | |
2022 | 86 | 269 | |
seattle | 2020 | 68 | 735 |
2021 | 66 | 734 | |
2022 | 70 | 733 |
- Return a DataFrame containing only the 2021 observations.
city_readin.xs(2021, level='year')
temp-aug | pop | |
---|---|---|
name | ||
madison | 84 | 268 |
seattle | 66 | 734 |
phoenix | 109 | 4900 |
Summary statistics by level
MultiIndexes provide a quick way to summarize data. We will see many different ways to do this — getting statistics by groups — and not all will involve a multiIndex.
# Compute average August temperature for Madison. Using xs.
'Madison\'s average August temperature: {:3.2f}'.format(city_readin.xs('madison', level='name')['temp-aug'].mean())
"Madison's average August temperature: 83.33"
Notice the syntax with xs.
city_readin.xs('madison', level='name')['temp-aug']
The city_readin.xs('madison', level='name')
is returning a DataFrame with all the columns. [Try it!]
We then use the usual square-bracket syntax to pick off just the column 'temp-aug' and then hit it with .mean()
.
# Compute average August temperature for Madison. Using loc, since name is the outermost index.
'Madison\'s average August temperature: {:3.2f}'.format(city_readin.loc['madison','temp-aug'].mean())
"Madison's average August temperature: 83.33"
# Compute average August temperature (across all cities) for 2021.
'2021 average August temperature: {:3.2f}'.format(city_readin.xs(2021, level='year')['temp-aug'].mean())
'2021 average August temperature: 86.33'
A multiIndex in columns
There is nothing that says you can't have multiple indexes in the axis=1
dimension. It can be a bit more confusing, especially when we are reading in a file.
Open up "nipa.xlsx" in Excel and take a look.
If the multiIndex is on the columns, pass header
a list of line numbers (integers). We also need to set the (row) index at the same time.
# Do not set the index. What is the name of the first column?
nipa = pd.read_excel('nipa.xlsx', header=[0,1])
nipa
Unit | Nominal | Real | |||
---|---|---|---|---|---|
Var | GDP | INV | GDP | INV | |
0 | 1990 | 5963.14450 | 993.44900 | 9371.46800 | 1223.03525 |
1 | 2000 | 10250.95200 | 2038.40800 | 13138.03525 | 2346.73125 |
2 | 2010 | 15048.97000 | 2165.47275 | 15648.99100 | 2216.47775 |
3 | 2020 | 21060.47425 | 3642.92525 | 18509.14275 | 3306.47325 |
nipa.columns
MultiIndex([( 'Unit', 'Var'),
('Nominal', 'GDP'),
('Nominal', 'INV'),
( 'Real', 'GDP'),
( 'Real', 'INV')],
)
nipa = pd.read_excel('nipa.xlsx', header=[0,1], index_col=0)
nipa
Unit | Nominal | Real | ||
---|---|---|---|---|
Var | GDP | INV | GDP | INV |
1990 | 5963.14450 | 993.44900 | 9371.46800 | 1223.03525 |
2000 | 10250.95200 | 2038.40800 | 13138.03525 | 2346.73125 |
2010 | 15048.97000 | 2165.47275 | 15648.99100 | 2216.47775 |
2020 | 21060.47425 | 3642.92525 | 18509.14275 | 3306.47325 |
nipa.axes
[Int64Index([1990, 2000, 2010, 2020], dtype='int64'),
MultiIndex([('Nominal', 'GDP'),
('Nominal', 'INV'),
( 'Real', 'GDP'),
( 'Real', 'INV')],
names=['Unit', 'Var'])]
We ask for columns with a multiIndex the same way we ask for rows.
Let's get nominal GDP.
nipa[('Nominal', 'GDP')]
1990 5963.14450
2000 10250.95200
2010 15048.97000
2020 21060.47425
Name: (Nominal, GDP), dtype: float64
Let's get only the nominal variables.
nipa['Nominal']
Var | GDP | INV |
---|---|---|
1990 | 5963.14450 | 993.44900 |
2000 | 10250.95200 | 2038.40800 |
2010 | 15048.97000 | 2165.47275 |
2020 | 21060.47425 | 3642.92525 |
Partial indexing with .xs()
works, too. But we have to be careful.
Uncomment the code below and run it. What is the code trying to do? What is wrong? How can we fix it?
# nipa.xs('Nominal', level=0)
For many methods of DataFrame
the default axis is 0. You can always check the documentation.
Practice
Let's take data from the Current Population Survey, which surveys about 60,000 households each month. We will compute some average wages. This is the survey used to produce the official unemployment rate measures for the United States and many more labor-market indicators.
We will need to clean up a bit, then work with a multiIndex. Think of this as a mini-project.
The unit of observation is a person. The variables are:
hrwage
: hourly wageeduc
: education levelfemale
: 1 if female, 0 if not-
fulltimely
: 1 if worked full time, 0 if not -
Load the March CPS data, 'CPS_March_2016.csv'. Note: the missing values are '.'
cps = pd.read_csv('CPS_March_2016.csv', na_values = '.')
cps.head(20)
hrwage | educ | female | fulltimely | |
---|---|---|---|---|
0 | 20.961538 | Some college | 0 | 1.0 |
1 | 20.192308 | HS diploma/GED | 1 | 1.0 |
2 | 6.410256 | Some college | 0 | 0.0 |
3 | NaN | Less than HS | 0 | NaN |
4 | NaN | Some college | 0 | NaN |
5 | NaN | HS diploma/GED | 1 | NaN |
6 | 14.285714 | HS diploma/GED | 1 | 1.0 |
7 | 0.000000 | Some college | 0 | 0.0 |
8 | NaN | HS diploma/GED | 1 | NaN |
9 | NaN | College degree | 0 | NaN |
10 | 18.269230 | Some college | 1 | 1.0 |
11 | NaN | HS diploma/GED | 1 | NaN |
12 | 59.523810 | Graduate degree | 0 | 1.0 |
13 | 18.367348 | College degree | 1 | 1.0 |
14 | 8.653846 | HS diploma/GED | 0 | 1.0 |
15 | 59.136210 | Graduate degree | 0 | 1.0 |
16 | NaN | College degree | 1 | NaN |
17 | NaN | Less than HS | 1 | NaN |
18 | NaN | Less than HS | 0 | NaN |
19 | NaN | NaN | 1 | NaN |
- Keep only those with
fulltimely == 1
- Keep only those with
5 <= hrwage <= 200
# Keep individuals who worked full time last year.
cps = cps[ cps['fulltimely'] == 1 ]
# Keep individuals with wages between $5 and $200.
cps = cps[cps['hrwage'] <= 200]
cps = cps[cps['hrwage'] >= 5]
# or
# cps = cps[ (cps['hrwage']<=200) & (cps['hrwage']>=5) & (cps['fulltimely'] == 1) ]
cps.sample(10)
hrwage | educ | female | fulltimely | |
---|---|---|---|---|
89952 | 9.134615 | HS diploma/GED | 1 | 1.0 |
21297 | 24.038462 | College degree | 0 | 1.0 |
82617 | 15.277778 | HS diploma/GED | 1 | 1.0 |
163386 | 157.342650 | Graduate degree | 0 | 1.0 |
8982 | 21.634615 | College degree | 1 | 1.0 |
50099 | 35.096153 | College degree | 1 | 1.0 |
131172 | 7.912088 | Some college | 0 | 1.0 |
94329 | 16.483517 | HS diploma/GED | 0 | 1.0 |
104462 | 18.750000 | HS diploma/GED | 0 | 1.0 |
10993 | 10.989011 | HS diploma/GED | 1 | 1.0 |
- Set the index to 'female' and 'educ', in that order.
- Sort the index.
cps.set_index(['female', 'educ'], inplace=True)
cps.sort_index(axis=0, inplace = True)
cps.sample(10)
hrwage | fulltimely | ||
---|---|---|---|
female | educ | ||
1 | HS diploma/GED | 6.119951 | 1.0 |
0 | College degree | 18.269230 | 1.0 |
1 | Less than HS | 17.482517 | 1.0 |
0 | Some college | 16.071428 | 1.0 |
Some college | 14.423077 | 1.0 | |
Less than HS | 5.388889 | 1.0 | |
Some college | 34.155000 | 1.0 | |
1 | Graduate degree | 28.846153 | 1.0 |
0 | HS diploma/GED | 14.102564 | 1.0 |
HS diploma/GED | 6.410256 | 1.0 |
- Report the average wage for males and females.
# Using xs()
avg_wage_f = cps.xs(1, level='female')['hrwage'].mean()
avg_wage_m = cps.xs(0, level='female')['hrwage'].mean()
print('Average wage of females is ${0:.2f} and males is ${1:.2f}.'.format(avg_wage_f, avg_wage_m) )
Average wage of females is $22.75 and males is $28.31.
# We can partial index using .loc[] ONLY on the outermost index. We can always use .xs().
avg_wage_f = cps.loc[1, 'hrwage'].mean() # Looking for 1 in the outermost index
avg_wage_m = cps.loc[0, 'hrwage'].mean() # Looking for 0 in the outermost index
print('Average wage of females is ${0:.2f} and males is ${1:.2f}.'.format(avg_wage_f, avg_wage_m) )
Average wage of females is $22.75 and males is $28.31.
- Report the average wage for
HS diploma/GED
and forCollege degree
, regardless of gender.
# We cannot use the .loc[] approach here because 'educ' is not the outermost index.
avg_wage_hs = cps.xs('HS diploma/GED', level='educ')['hrwage'].mean()
avg_wage_cd = cps.xs('College degree', level='educ')['hrwage'].mean()
print('Average wage of high school is ${0:.2f} and college is ${1:.2f}.'.format(avg_wage_hs, avg_wage_cd) )
Average wage of high school is $19.11 and college is $31.96.