download notebook
view notebook w/o solutions
Groupby
files needed = ('Most-Recent-Cohorts-Institution.zip')
We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have?
Pandas provides the groupby( )
method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as
split-apply-combine
Here is the canonical illustration. The big idea is to
- Split the data up into groups. The groups are defined by key variables.
- Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
- Combine the results of the 'apply' from each group into a new data structure.
Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here.
import pandas as pd
import matplotlib.pyplot as plt
College Scorecard
Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Dept. of Education to help students evaluate higher education institutions. The data are very well documented and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.
We will be working off of the 'institution-level' data file.
[There is an api, too, but we would all have to sign up for a key.]
The data set is large. There are thousands of variables. I am creating a dict that holds the variable name in the data set and the name I will use in my DataFrame. This will come in handy...
How did I figure out the variable names and definitions? I read the documentation.
variables = {'CONTROL':'ownership', 'INSTNM':'name',
'STABBR':'state', 'PREDDEG':'type',
'SATVRMID':'sat_read_med', 'SATMTMID':'sat_math_med', 'SATWRMID':'sat_write_med',
'PCIP52':'sh_bus', 'PCIP11':'sh_cs', 'PCIP14':'sh_eng', 'PCIP26':'sh_bio',
'PCIP45':'sh_ss', 'PCIP51':'sh_health', 'PCIP09':'sh_com',
'MD_EARN_WNE_P10':'earn_10', 'GRAD_DEBT_MDN_SUPP':'debt_at_grad'}
I'm doing couple things here that we may not have discussed before.
-
I'm reading the zip file without first decompressing it. Since there is only one file in the zipped package, pandas reads the file into a DataFrame. Pandas tries to guess the compression format. You can use the
compression
argument if it isn't able to infer it. -
I'm using the
.keys()
method of dict to provide the names of the columns that I want loaded. This saves me the trouble of loading all the columns and then keeping only the ones I want. [This is the first place I'll use thevariables
dict.]
colscd = pd.read_csv('Most-Recent-Cohorts-Institution.zip',
na_values=['PrivacySuppressed', 'NULL'],
usecols=variables.keys())
We are loading a new data set. Let's do our checks.
Here, I'm using the .info()
method of DataFrame. It gives me the shape of the DataFrame and the dtypes.
colscd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6543 entries, 0 to 6542
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 INSTNM 6543 non-null object
1 STABBR 6543 non-null object
2 PREDDEG 6543 non-null int64
3 CONTROL 6543 non-null int64
4 SATVRMID 982 non-null float64
5 SATMTMID 982 non-null float64
6 SATWRMID 702 non-null float64
7 PCIP09 5765 non-null float64
8 PCIP11 5765 non-null float64
9 PCIP14 5765 non-null float64
10 PCIP26 5765 non-null float64
11 PCIP45 5765 non-null float64
12 PCIP51 5765 non-null float64
13 PCIP52 5765 non-null float64
14 MD_EARN_WNE_P10 5165 non-null float64
15 GRAD_DEBT_MDN_SUPP 4894 non-null float64
dtypes: float64(12), int64(2), object(2)
memory usage: 818.0+ KB
display(colscd.head(2))
display(colscd.tail(2))
INSTNM | STABBR | PREDDEG | CONTROL | SATVRMID | SATMTMID | SATWRMID | PCIP09 | PCIP11 | PCIP14 | PCIP26 | PCIP45 | PCIP51 | PCIP52 | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | AL | 3 | 1 | 475.0 | 460.0 | 414.0 | 0.0000 | 0.0658 | 0.1210 | 0.1690 | 0.0249 | 0.0000 | 0.1637 | 39059.0 | 31000.0 |
1 | University of Alabama at Birmingham | AL | 3 | 1 | 637.0 | 644.0 | NaN | 0.0333 | 0.0229 | 0.0559 | 0.1125 | 0.0239 | 0.2088 | 0.2159 | 50907.0 | 22300.0 |
INSTNM | STABBR | PREDDEG | CONTROL | SATVRMID | SATMTMID | SATWRMID | PCIP09 | PCIP11 | PCIP14 | PCIP26 | PCIP45 | PCIP51 | PCIP52 | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6541 | Pennsylvania State University-Penn State Brand... | PA | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61185.0 | 25000.0 |
6542 | Burlington County Institute of Technology - Ad... | NJ | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Data preparation
Let's clean this up a bit.
- Let's rename the variables to something easier to understand. This is the second place I'm using the
variables
dict.
colscd = colscd.rename(columns = variables)
colscd.head(3)
name | state | type | ownership | sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | AL | 3 | 1 | 475.0 | 460.0 | 414.0 | 0.0000 | 0.0658 | 0.1210 | 0.1690 | 0.0249 | 0.0000 | 0.1637 | 39059.0 | 31000.0 |
1 | University of Alabama at Birmingham | AL | 3 | 1 | 637.0 | 644.0 | NaN | 0.0333 | 0.0229 | 0.0559 | 0.1125 | 0.0239 | 0.2088 | 0.2159 | 50907.0 | 22300.0 |
2 | Amridge University | AL | 3 | 2 | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.5000 | 37730.0 | 32189.0 |
- Let's recode the categorical variables. The ownership and type variables are coded as integers. I would rather they were easy to understand. We could use statements like
colscd.loc[colscd['type']==1, 'type'] = 'cert'
for each change we want to make, but that would get tedious. Instead, we will use the pandas method .replace()
. We pass the .replace()
method a dictionary in which the keys are the objects we want to replace and the values are what we want to replace them with.
Let's take a look.
type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)
own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()
name | state | type | ownership | sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | AL | bach | Public | 475.0 | 460.0 | 414.0 | 0.0000 | 0.0658 | 0.1210 | 0.1690 | 0.0249 | 0.0000 | 0.1637 | 39059.0 | 31000.0 |
1 | University of Alabama at Birmingham | AL | bach | Public | 637.0 | 644.0 | NaN | 0.0333 | 0.0229 | 0.0559 | 0.1125 | 0.0239 | 0.2088 | 0.2159 | 50907.0 | 22300.0 |
2 | Amridge University | AL | bach | Private nonprofit | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.5000 | 37730.0 | 32189.0 |
3 | University of Alabama in Huntsville | AL | bach | Public | 640.0 | 655.0 | NaN | 0.0140 | 0.0692 | 0.3028 | 0.0504 | 0.0127 | 0.1341 | 0.1930 | 56901.0 | 20705.0 |
4 | Alabama State University | AL | bach | Public | 488.0 | 457.0 | NaN | 0.0923 | 0.0530 | 0.0138 | 0.0786 | 0.0196 | 0.0904 | 0.1513 | 34018.0 | 31000.0 |
Set the index to the university name. How does the UW look?
colscd.set_index('name', inplace=True)
colscd.loc['University of Wisconsin-Madison']
state WI
type bach
ownership Public
sat_read_med 680.0
sat_math_med 730.0
sat_write_med 640.0
sh_com 0.0616
sh_cs 0.0586
sh_eng 0.0909
sh_bio 0.1025
sh_ss 0.0982
sh_health 0.0732
sh_bus 0.1478
earn_10 70586.0
debt_at_grad 20484.0
Name: University of Wisconsin-Madison, dtype: object
We are a public university that predominantly gives out bachelor degrees. The largest major is business, then biology and engineering. Median earnings 10 years after graduating is 70k and median debt at graduation is 20k.
Apply-split-combine
The data are in good shape. Let's explore .groupby()
.
We look at each step separately first, to see what it does, then we will chain them all together to apply-split-combine all in one line of code.
1. Split: groupby( )
We pass .groupby()
a key which tells the method which variable to, well, group by. We will group by state. This is the split step.
What is colscd_grouped
?
colscd_grouped = colscd.groupby('state')
print(type(colscd_grouped))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.
What does it look like? A DataFrameGroupBy is an iterable object. It returns subsets of the original DataFrame by group. In our case, the groups are defined by state.
The .get_group()
returns a group.
colscd_grouped.get_group('WI').sort_index()
# colscd_grouped.get_group('AK').sort_index()
state | type | ownership | sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
Academy of Beauty Professionals | WI | cert | Private profit | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 23064.0 | 6333.0 |
Academy of Beauty Professionals | WI | cert | Private profit | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 23064.0 | 6333.0 |
Academy of Beauty Professionals | WI | cert | Private profit | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 23064.0 | 6333.0 |
Advanced College of Cosmetology | WI | cert | Private profit | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN |
Alverno College | WI | bach | Private nonprofit | NaN | NaN | 498.0 | 0.0284 | 0.0000 | 0.0 | 0.0426 | 0.0106 | 0.4397 | 0.0816 | 48547.0 | 27000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Waukesha County Technical College | WI | cert | Public | NaN | NaN | NaN | 0.0000 | 0.0467 | 0.0 | 0.0000 | 0.0000 | 0.4476 | 0.1053 | 43520.0 | 12000.0 |
Western Technical College | WI | asc | Public | NaN | NaN | NaN | 0.0165 | 0.0330 | 0.0 | 0.0000 | 0.0000 | 0.4258 | 0.1940 | 42215.0 | 11500.0 |
Wisconsin Lutheran College | WI | bach | Private nonprofit | 540.0 | 550.0 | NaN | 0.0356 | 0.0222 | 0.0 | 0.0978 | 0.0311 | 0.1156 | 0.3022 | 51291.0 | 26000.0 |
Wisconsin School of Professional Psychology | WI | grad_only | Private nonprofit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Wright Graduate University for the Realization of Human Potential | WI | grad_only | Private nonprofit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
93 rows × 15 columns
...and there is every degree-granting institution in Wisconsin from Academy of Beauty Professionals to the Wright Graduate University for the Realization of Human Potential.
2. + 3. Apply and combine
A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.
# apply the mean operator to the grouped data
all_means = colscd_grouped.mean(numeric_only=True)
# what do we get back?
print(type(all_means))
<class 'pandas.core.frame.DataFrame'>
# Ah, a DataFrame. We know what to do with that.
all_means.head(10)
sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
state | ||||||||||||
AK | 620.000000 | 615.000000 | 482.500000 | 0.002011 | 0.023656 | 0.012044 | 0.010633 | 0.012256 | 0.217022 | 0.164744 | 43604.500000 | 18097.800000 |
AL | 551.444444 | 545.111111 | 498.000000 | 0.013615 | 0.023825 | 0.014719 | 0.026375 | 0.010513 | 0.226493 | 0.122677 | 38225.338028 | 19492.215385 |
AR | 546.285714 | 538.714286 | 464.200000 | 0.007577 | 0.016767 | 0.004451 | 0.015688 | 0.009674 | 0.203054 | 0.091672 | 32972.202703 | 14005.735294 |
AS | NaN | NaN | NaN | 0.000000 | 0.022700 | 0.051100 | 0.000000 | 0.005700 | 0.108000 | 0.130700 | 28524.000000 | NaN |
AZ | 560.000000 | 572.500000 | 505.000000 | 0.004801 | 0.020099 | 0.006194 | 0.006886 | 0.008667 | 0.251198 | 0.060630 | 40854.298851 | 12736.276596 |
CA | 639.181818 | 638.757576 | 570.540000 | 0.013127 | 0.024717 | 0.006771 | 0.014038 | 0.029237 | 0.265019 | 0.084106 | 44461.812371 | 14126.306843 |
CO | 582.000000 | 577.000000 | 570.000000 | 0.020616 | 0.036166 | 0.019154 | 0.015472 | 0.018813 | 0.200251 | 0.089144 | 41578.523077 | 15259.739726 |
CT | 620.000000 | 616.812500 | 532.083333 | 0.018118 | 0.015522 | 0.026929 | 0.023099 | 0.038351 | 0.229379 | 0.090135 | 50771.413793 | 15401.830769 |
DC | 642.800000 | 634.600000 | 605.000000 | 0.023706 | 0.052100 | 0.016976 | 0.022359 | 0.087912 | 0.186541 | 0.246371 | 51325.470588 | 19868.222222 |
DE | 628.000000 | 620.000000 | 505.000000 | 0.009580 | 0.033613 | 0.008400 | 0.010433 | 0.019440 | 0.218613 | 0.125167 | 40319.153846 | 15387.500000 |
When we used .mean()
on the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.
Notice that the categorical data (name, state, type) have been dropped. .mean()
doesn't know what to do with string data.
1. + 2. + 3. Split-apply-combine
Computing the grouped data first helped us understand what was happening, but we typically do the whole split-apply-combine in one step. One simple line of code.
all_means = colscd.groupby('state').mean(numeric_only=True)
all_means.head(10)
sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
state | ||||||||||||
AK | 620.000000 | 615.000000 | 482.500000 | 0.002011 | 0.023656 | 0.012044 | 0.010633 | 0.012256 | 0.217022 | 0.164744 | 43604.500000 | 18097.800000 |
AL | 551.444444 | 545.111111 | 498.000000 | 0.013615 | 0.023825 | 0.014719 | 0.026375 | 0.010513 | 0.226493 | 0.122677 | 38225.338028 | 19492.215385 |
AR | 546.285714 | 538.714286 | 464.200000 | 0.007577 | 0.016767 | 0.004451 | 0.015688 | 0.009674 | 0.203054 | 0.091672 | 32972.202703 | 14005.735294 |
AS | NaN | NaN | NaN | 0.000000 | 0.022700 | 0.051100 | 0.000000 | 0.005700 | 0.108000 | 0.130700 | 28524.000000 | NaN |
AZ | 560.000000 | 572.500000 | 505.000000 | 0.004801 | 0.020099 | 0.006194 | 0.006886 | 0.008667 | 0.251198 | 0.060630 | 40854.298851 | 12736.276596 |
CA | 639.181818 | 638.757576 | 570.540000 | 0.013127 | 0.024717 | 0.006771 | 0.014038 | 0.029237 | 0.265019 | 0.084106 | 44461.812371 | 14126.306843 |
CO | 582.000000 | 577.000000 | 570.000000 | 0.020616 | 0.036166 | 0.019154 | 0.015472 | 0.018813 | 0.200251 | 0.089144 | 41578.523077 | 15259.739726 |
CT | 620.000000 | 616.812500 | 532.083333 | 0.018118 | 0.015522 | 0.026929 | 0.023099 | 0.038351 | 0.229379 | 0.090135 | 50771.413793 | 15401.830769 |
DC | 642.800000 | 634.600000 | 605.000000 | 0.023706 | 0.052100 | 0.016976 | 0.022359 | 0.087912 | 0.186541 | 0.246371 | 51325.470588 | 19868.222222 |
DE | 628.000000 | 620.000000 | 505.000000 | 0.009580 | 0.033613 | 0.008400 | 0.010433 | 0.019440 | 0.218613 | 0.125167 | 40319.153846 | 15387.500000 |
Aggregation methods
Some common aggregation methods include: .mean()
, .sum()
, .std()
, .describe()
, .min()
, .max()
, but there are many more. Any function that returns a scalar will work.
groupby( ) on a subset of columns
We may not care about all the columns in our datset for a particular groupby. We can subset our DataFrame as usual and compute a groupby. This saves computing statistics we do not want. For a large data set, this can save time.
Let's focus on the median SAT scores. We will group by the 'ownership' variable.
# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
sat_medians_1 = colscd[['sat_read_med', 'sat_math_med', 'sat_write_med', 'ownership']].groupby('ownership').median()
sat_medians_1
sat_read_med | sat_math_med | sat_write_med | |
---|---|---|---|
ownership | |||
Private nonprofit | 580.0 | 570.0 | 509.0 |
Private profit | 630.0 | 622.0 | 475.0 |
Public | 565.0 | 551.0 | 495.0 |
Practice
- Create a dataset with only the public institutions from
colscd
DataFrame. Name itpub
.
The public institutions are the ones with ownership
equal to 'Public'.
pub = colscd[colscd['ownership']=='Public']
The .quantile( )
method (docs) computes quantiles from the data. (e.g., .quantile(0.5)
computes the median, or the the 50th quantile)
-
Let's look at a measure of the earnings variation for different institution types:
a. Compute the 75th quantile for 'earn_10' for each type. Name the new DataFrame 'q75'.
b. Compute the 25th quantile for 'earn_10' for each type. Name the new DataFrame 'q25'.
You should have two new DataFrames, each containing one of the quantile statistics.
q75 = pub[['earn_10', 'type']].groupby('type').quantile(0.75)
q25 = pub[['earn_10', 'type']].groupby('type').quantile(0.25)
2d. For each type, compute the difference between the 75 percentile and the 25 percentile.
This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.
iqr = q75 - q25
iqr
earn_10 | |
---|---|
type | |
asc | 6556.00 |
bach | 14397.75 |
cert | 8625.50 |
grad_only | 3720.50 |
na | 24715.00 |
Understanding the variation in income from different school types could be an interesting project. Let's practice some more.
-
How do SAT reading scores and graduate earning correlate?
a. Compute the median SAT reading score and median 10-year earnings by state.
sat_rw_med = pub[['sat_read_med', 'earn_10', 'state']].groupby('state').median()
sat_rw_med.head()
sat_read_med | earn_10 | |
---|---|---|
state | ||
AK | 620.0 | 46085.0 |
AL | 541.5 | 37250.0 |
AR | 550.0 | 33405.0 |
AS | NaN | 28524.0 |
AZ | 620.0 | 43959.0 |
3b. Create a scatter plot with the median SAT reading score on the x axis and median 10-year income on the y axis.
If you read the seaborn notebook from week 5, you can try a .regplot()
.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(sat_rw_med['sat_read_med'], sat_rw_med['earn_10'], color='black', marker='*')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')
plt.show()
import seaborn as sns
fig, ax = plt.subplots(figsize=(10,6))
sns.regplot(x=sat_rw_med['sat_read_med'], y=sat_rw_med['earn_10'], color='black', marker='*')
ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')
sns.despine()
plt.show()
Several statistics at once
Once we have grouped our data, we have been applying methods to compute a single statistic: mean()
, count()
,...
We now introduced the .agg( )
method, which lets us compute several moments at once—you can even pass it a user-defined function.
# This is the same as groupby('type').count()
colscd[['earn_10', 'type']].groupby('type').agg('count')
#colscd[['earn_10', 'type']].groupby('type').count()
earn_10 | |
---|---|
type | |
asc | 985 |
bach | 1815 |
cert | 1906 |
grad_only | 21 |
na | 438 |
# But agg() lets us compute many stats at once
colscd[['earn_10', 'sat_read_med', 'type']].groupby('type').agg(['count', 'mean', 'median', 'std', 'max'])
earn_10 | sat_read_med | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | mean | median | std | max | count | mean | median | std | max | |
type | ||||||||||
asc | 985 | 39093.609137 | 37765.0 | 9991.976462 | 107401.0 | 16 | 533.375000 | 534.5 | 64.785672 | 720.0 |
bach | 1815 | 52838.200551 | 50273.0 | 15950.906184 | 129442.0 | 963 | 581.755971 | 575.0 | 65.794944 | 760.0 |
cert | 1906 | 30573.251312 | 28400.5 | 10887.433084 | 95079.0 | 3 | 508.666667 | 493.0 | 29.771351 | 543.0 |
grad_only | 21 | 63038.333333 | 58068.0 | 20185.906540 | 100141.0 | 0 | NaN | NaN | NaN | NaN |
na | 438 | 42066.890411 | 42432.0 | 13807.471374 | 99012.0 | 0 | NaN | NaN | NaN | NaN |
Now we have a multiIndex on the columns.
We can use .agg()
to pass a user-defined function, too. We will work on that in the practice problems.
groupby( ) with many keys
Can we group by several keys? You know we can. Let's compute the medians of the DataFrame this time.
all_medians = colscd.groupby(['ownership','type']).median(numeric_only=True)
all_medians
sat_read_med | sat_math_med | sat_write_med | sh_com | sh_cs | sh_eng | sh_bio | sh_ss | sh_health | sh_bus | earn_10 | debt_at_grad | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ownership | type | ||||||||||||
Private nonprofit | asc | 534.0 | 503.0 | 455.0 | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.13220 | 0.00000 | 38461.0 | 16000.0 |
bach | 580.0 | 570.0 | 510.0 | 0.01080 | 0.00930 | 0.00000 | 0.0419 | 0.01190 | 0.03520 | 0.15140 | 51312.0 | 25000.0 | |
cert | 493.0 | 465.0 | 505.0 | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.51185 | 0.00000 | 29012.0 | 12000.0 | |
grad_only | NaN | NaN | NaN | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 58068.0 | 20000.0 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 44642.0 | 23879.0 | |
Private profit | asc | 540.0 | 505.0 | NaN | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.57960 | 0.00000 | 34922.0 | 18668.0 |
bach | 641.5 | 627.0 | 475.0 | 0.00000 | 0.01350 | 0.00000 | 0.0000 | 0.00000 | 0.00370 | 0.06980 | 42990.0 | 26123.0 | |
cert | NaN | NaN | NaN | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 25408.0 | 9500.0 | |
grad_only | NaN | NaN | NaN | 0.00000 | 0.00000 | 0.00000 | 0.0000 | 0.00000 | 0.00000 | 0.00000 | 72945.0 | 32946.0 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 37168.0 | 18668.0 | |
Public | asc | 527.5 | 497.5 | 430.5 | 0.00000 | 0.02740 | 0.00000 | 0.0000 | 0.00000 | 0.18480 | 0.09750 | 38011.0 | 10500.0 |
bach | 565.0 | 555.0 | 495.0 | 0.03575 | 0.03270 | 0.00835 | 0.0515 | 0.04255 | 0.09500 | 0.16585 | 50715.5 | 20705.0 | |
cert | NaN | NaN | 391.0 | 0.00000 | 0.01445 | 0.00000 | 0.0000 | 0.00000 | 0.30915 | 0.02995 | 35473.0 | 9500.0 | |
grad_only | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 68595.5 | 26814.0 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 42689.0 | 17500.0 |
Now we have a multiIndexed DataFrame with the summary statistics, this time, the median.
The three ownership types (private nonprofit, private profit, and public) all have institutions that predominately offer bachelors degrees. How do the median SAT scores compare across public and private institutions?
Let's grab that set of statistics, a quick practice with multiIndex.
bach_sat_med = all_medians[['sat_read_med', 'sat_math_med', 'sat_write_med']].xs('bach', level='type') # xs() indexes data from a MultiIndex
bach_sat_med
sat_read_med | sat_math_med | sat_write_med | |
---|---|---|---|
ownership | |||
Private nonprofit | 580.0 | 570.0 | 510.0 |
Private profit | 641.5 | 627.0 | 475.0 |
Public | 565.0 | 555.0 | 495.0 |
Practice
- Below is a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame).
The input, x
, will be a column of a DataFrame. The output is a single number.
def avg5(x):
top5 = x.sort_values().iloc[-5:] # sort x, then grab the five largest.
return top5.mean() # return the mean
def avg5(x):
top5 = x.sort_values().iloc[-5:] # sort x, then grab the five largest.
return top5.mean() # return the mean
- Test the function on column 'a' of the DataFrame defined below. The answer should be 8.
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
The command would look like:
avg5(test['a'])
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
test
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 4 | 3 |
2 | 6 | 4 |
3 | 9 | 5 |
4 | 10 | 6 |
5 | 3 | 7 |
6 | 7 | 8 |
7 | 8 | 10 |
avg5(test['a'])
8.0
Now return to colscd
- Drop any observation that has 'debt_at_grad' == NaN. Use the
.dropna()
method (docs).
colscd.dropna(subset=['debt_at_grad'], inplace=True)
- Compute the mean, median, and avg5 of 'debt_at_grad' by 'ownership'. Compute them all at once using
.agg()
. Sinceavg5
is user-defined, we pass the name of the function without quotes. i.e.,.agg(['mean', avg5])
.
colscd.dropna(subset=['debt_at_grad'] , inplace=True) # Drop the missing values
colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5]) # Compute the summary stats
mean | median | avg5 | |
---|---|---|---|
ownership | |||
Private nonprofit | 22270.035165 | 24000.0 | 39597.2 |
Private profit | 12646.573190 | 9500.0 | 42794.6 |
Public | 14655.012642 | 12500.0 | 32100.0 |
Extra practice
If you want to practice some more, try writing two functions: One returns the 25 percentile and one returns the 75 percentile.
Then redo question 2 from the first practice, but using only one groupby and the .agg()
method.
- Let's look at a measure of the earnings spread for different institution types
- Compute the 75th quantile for 'earn_10' for each 'type'.
- Compute the 25th quantile for 'earn_10' for each 'type'.
You should have only one new DataFrame containing all of the quantile statistics.
def qt75(x):
return x.quantile(0.75)
def qt25(x):
return x.quantile(0.25)
qts = colscd[['earn_10', 'type']].groupby('type').agg([qt25, qt75])
qts
earn_10 | ||
---|---|---|
qt25 | qt75 | |
type | ||
asc | 35115.25 | 42925.0 |
bach | 43634.00 | 60030.5 |
cert | 23476.00 | 35856.0 |
grad_only | 58068.00 | 68306.0 |
na | 36470.00 | 44919.0 |
2d. For each type, compute the difference between the 75 percentile and the 25 percentile.
This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.
qts[('earn_10', 'iqr')] = qts[('earn_10', 'qt75')] - qts[('earn_10', 'qt25')]
qts
earn_10 | |||
---|---|---|---|
qt25 | qt75 | iqr | |
type | |||
asc | 35115.25 | 42925.0 | 7809.75 |
bach | 43634.00 | 60030.5 | 16396.50 |
cert | 23476.00 | 35856.0 | 12380.00 |
grad_only | 58068.00 | 68306.0 | 10238.00 |
na | 36470.00 | 44919.0 | 8449.00 |
[The simplest way to do this is to use a lambda function rather than write our own two functions. We have not covered these in class but it would not be difficult to learn on your own. Try searching for 'pandas agg pass argument.']