download notebook
view notebook w/o solutions

Groupby

files needed = ('Most-Recent-Cohorts-Institution.zip')

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have?

Pandas provides the groupby( ) method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as

split-apply-combine

Here is the canonical illustration. The big idea is to

  1. Split the data up into groups. The groups are defined by key variables.
  2. Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
  3. Combine the results of the 'apply' from each group into a new data structure.

Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here.

import pandas as pd
import matplotlib.pyplot as plt

College Scorecard

Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Dept. of Education to help students evaluate higher education institutions. The data are very well documented and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.

We will be working off of the 'institution-level' data file.

[There is an api, too, but we would all have to sign up for a key.]

The data set is large. There are thousands of variables. I am creating a dict that holds the variable name in the data set and the name I will use in my DataFrame. This will come in handy...

How did I figure out the variable names and definitions? I read the documentation.

variables = {'CONTROL':'ownership', 'INSTNM':'name', 
             'STABBR':'state', 'PREDDEG':'type', 
             'SATVRMID':'sat_read_med', 'SATMTMID':'sat_math_med', 'SATWRMID':'sat_write_med', 
             'PCIP52':'sh_bus', 'PCIP11':'sh_cs', 'PCIP14':'sh_eng', 'PCIP26':'sh_bio', 
             'PCIP45':'sh_ss', 'PCIP51':'sh_health', 'PCIP09':'sh_com',
             'MD_EARN_WNE_P10':'earn_10', 'GRAD_DEBT_MDN_SUPP':'debt_at_grad'}

I'm doing couple things here that we may not have discussed before.

  1. I'm reading the zip file without first decompressing it. Since there is only one file in the zipped package, pandas reads the file into a DataFrame. Pandas tries to guess the compression format. You can use the compression argument if it isn't able to infer it.

  2. I'm using the .keys() method of dict to provide the names of the columns that I want loaded. This saves me the trouble of loading all the columns and then keeping only the ones I want. [This is the first place I'll use the variables dict.]

colscd = pd.read_csv('Most-Recent-Cohorts-Institution.zip', 
                     na_values=['PrivacySuppressed', 'NULL'], 
                     usecols=variables.keys())

We are loading a new data set. Let's do our checks.

Here, I'm using the .info() method of DataFrame. It gives me the shape of the DataFrame and the dtypes.

colscd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6543 entries, 0 to 6542
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   INSTNM              6543 non-null   object 
 1   STABBR              6543 non-null   object 
 2   PREDDEG             6543 non-null   int64  
 3   CONTROL             6543 non-null   int64  
 4   SATVRMID            982 non-null    float64
 5   SATMTMID            982 non-null    float64
 6   SATWRMID            702 non-null    float64
 7   PCIP09              5765 non-null   float64
 8   PCIP11              5765 non-null   float64
 9   PCIP14              5765 non-null   float64
 10  PCIP26              5765 non-null   float64
 11  PCIP45              5765 non-null   float64
 12  PCIP51              5765 non-null   float64
 13  PCIP52              5765 non-null   float64
 14  MD_EARN_WNE_P10     5165 non-null   float64
 15  GRAD_DEBT_MDN_SUPP  4894 non-null   float64
dtypes: float64(12), int64(2), object(2)
memory usage: 818.0+ KB
display(colscd.head(2))
display(colscd.tail(2))
INSTNM STABBR PREDDEG CONTROL SATVRMID SATMTMID SATWRMID PCIP09 PCIP11 PCIP14 PCIP26 PCIP45 PCIP51 PCIP52 MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University AL 3 1 475.0 460.0 414.0 0.0000 0.0658 0.1210 0.1690 0.0249 0.0000 0.1637 39059.0 31000.0
1 University of Alabama at Birmingham AL 3 1 637.0 644.0 NaN 0.0333 0.0229 0.0559 0.1125 0.0239 0.2088 0.2159 50907.0 22300.0
INSTNM STABBR PREDDEG CONTROL SATVRMID SATMTMID SATWRMID PCIP09 PCIP11 PCIP14 PCIP26 PCIP45 PCIP51 PCIP52 MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
6541 Pennsylvania State University-Penn State Brand... PA 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 61185.0 25000.0
6542 Burlington County Institute of Technology - Ad... NJ 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Data preparation

Let's clean this up a bit.

  1. Let's rename the variables to something easier to understand. This is the second place I'm using the variables dict.
colscd = colscd.rename(columns = variables)
colscd.head(3)
name state type ownership sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
0 Alabama A & M University AL 3 1 475.0 460.0 414.0 0.0000 0.0658 0.1210 0.1690 0.0249 0.0000 0.1637 39059.0 31000.0
1 University of Alabama at Birmingham AL 3 1 637.0 644.0 NaN 0.0333 0.0229 0.0559 0.1125 0.0239 0.2088 0.2159 50907.0 22300.0
2 Amridge University AL 3 2 NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 37730.0 32189.0
  1. Let's recode the categorical variables. The ownership and type variables are coded as integers. I would rather they were easy to understand. We could use statements like
colscd.loc[colscd['type']==1, 'type'] = 'cert'

for each change we want to make, but that would get tedious. Instead, we will use the pandas method .replace(). We pass the .replace() method a dictionary in which the keys are the objects we want to replace and the values are what we want to replace them with.

Let's take a look.

type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)

own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()
name state type ownership sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
0 Alabama A & M University AL bach Public 475.0 460.0 414.0 0.0000 0.0658 0.1210 0.1690 0.0249 0.0000 0.1637 39059.0 31000.0
1 University of Alabama at Birmingham AL bach Public 637.0 644.0 NaN 0.0333 0.0229 0.0559 0.1125 0.0239 0.2088 0.2159 50907.0 22300.0
2 Amridge University AL bach Private nonprofit NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 37730.0 32189.0
3 University of Alabama in Huntsville AL bach Public 640.0 655.0 NaN 0.0140 0.0692 0.3028 0.0504 0.0127 0.1341 0.1930 56901.0 20705.0
4 Alabama State University AL bach Public 488.0 457.0 NaN 0.0923 0.0530 0.0138 0.0786 0.0196 0.0904 0.1513 34018.0 31000.0

Set the index to the university name. How does the UW look?

colscd.set_index('name', inplace=True)
colscd.loc['University of Wisconsin-Madison']
state                 WI
type                bach
ownership         Public
sat_read_med       680.0
sat_math_med       730.0
sat_write_med      640.0
sh_com            0.0616
sh_cs             0.0586
sh_eng            0.0909
sh_bio            0.1025
sh_ss             0.0982
sh_health         0.0732
sh_bus            0.1478
earn_10          70586.0
debt_at_grad     20484.0
Name: University of Wisconsin-Madison, dtype: object

We are a public university that predominantly gives out bachelor degrees. The largest major is business, then biology and engineering. Median earnings 10 years after graduating is 70k and median debt at graduation is 20k.

Apply-split-combine

The data are in good shape. Let's explore .groupby().

We look at each step separately first, to see what it does, then we will chain them all together to apply-split-combine all in one line of code.

1. Split: groupby( )

We pass .groupby() a key which tells the method which variable to, well, group by. We will group by state. This is the split step.

What is colscd_grouped?

colscd_grouped = colscd.groupby('state')
print(type(colscd_grouped))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.

What does it look like? A DataFrameGroupBy is an iterable object. It returns subsets of the original DataFrame by group. In our case, the groups are defined by state.

The .get_group() returns a group.

colscd_grouped.get_group('WI').sort_index()

# colscd_grouped.get_group('AK').sort_index()
state type ownership sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
name
Academy of Beauty Professionals WI cert Private profit NaN NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 23064.0 6333.0
Academy of Beauty Professionals WI cert Private profit NaN NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 23064.0 6333.0
Academy of Beauty Professionals WI cert Private profit NaN NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 23064.0 6333.0
Advanced College of Cosmetology WI cert Private profit NaN NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 NaN NaN
Alverno College WI bach Private nonprofit NaN NaN 498.0 0.0284 0.0000 0.0 0.0426 0.0106 0.4397 0.0816 48547.0 27000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Waukesha County Technical College WI cert Public NaN NaN NaN 0.0000 0.0467 0.0 0.0000 0.0000 0.4476 0.1053 43520.0 12000.0
Western Technical College WI asc Public NaN NaN NaN 0.0165 0.0330 0.0 0.0000 0.0000 0.4258 0.1940 42215.0 11500.0
Wisconsin Lutheran College WI bach Private nonprofit 540.0 550.0 NaN 0.0356 0.0222 0.0 0.0978 0.0311 0.1156 0.3022 51291.0 26000.0
Wisconsin School of Professional Psychology WI grad_only Private nonprofit NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Wright Graduate University for the Realization of Human Potential WI grad_only Private nonprofit NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

93 rows × 15 columns

...and there is every degree-granting institution in Wisconsin from Academy of Beauty Professionals to the Wright Graduate University for the Realization of Human Potential.

2. + 3. Apply and combine

A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.

# apply the mean operator to the grouped data
all_means = colscd_grouped.mean(numeric_only=True)  

 # what do we get back?
print(type(all_means))            
<class 'pandas.core.frame.DataFrame'>
# Ah, a DataFrame. We know what to do with that. 
all_means.head(10)
sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
state
AK 620.000000 615.000000 482.500000 0.002011 0.023656 0.012044 0.010633 0.012256 0.217022 0.164744 43604.500000 18097.800000
AL 551.444444 545.111111 498.000000 0.013615 0.023825 0.014719 0.026375 0.010513 0.226493 0.122677 38225.338028 19492.215385
AR 546.285714 538.714286 464.200000 0.007577 0.016767 0.004451 0.015688 0.009674 0.203054 0.091672 32972.202703 14005.735294
AS NaN NaN NaN 0.000000 0.022700 0.051100 0.000000 0.005700 0.108000 0.130700 28524.000000 NaN
AZ 560.000000 572.500000 505.000000 0.004801 0.020099 0.006194 0.006886 0.008667 0.251198 0.060630 40854.298851 12736.276596
CA 639.181818 638.757576 570.540000 0.013127 0.024717 0.006771 0.014038 0.029237 0.265019 0.084106 44461.812371 14126.306843
CO 582.000000 577.000000 570.000000 0.020616 0.036166 0.019154 0.015472 0.018813 0.200251 0.089144 41578.523077 15259.739726
CT 620.000000 616.812500 532.083333 0.018118 0.015522 0.026929 0.023099 0.038351 0.229379 0.090135 50771.413793 15401.830769
DC 642.800000 634.600000 605.000000 0.023706 0.052100 0.016976 0.022359 0.087912 0.186541 0.246371 51325.470588 19868.222222
DE 628.000000 620.000000 505.000000 0.009580 0.033613 0.008400 0.010433 0.019440 0.218613 0.125167 40319.153846 15387.500000

When we used .mean() on the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.

Notice that the categorical data (name, state, type) have been dropped. .mean() doesn't know what to do with string data.

1. + 2. + 3. Split-apply-combine

Computing the grouped data first helped us understand what was happening, but we typically do the whole split-apply-combine in one step. One simple line of code.

all_means = colscd.groupby('state').mean(numeric_only=True)
all_means.head(10)
sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
state
AK 620.000000 615.000000 482.500000 0.002011 0.023656 0.012044 0.010633 0.012256 0.217022 0.164744 43604.500000 18097.800000
AL 551.444444 545.111111 498.000000 0.013615 0.023825 0.014719 0.026375 0.010513 0.226493 0.122677 38225.338028 19492.215385
AR 546.285714 538.714286 464.200000 0.007577 0.016767 0.004451 0.015688 0.009674 0.203054 0.091672 32972.202703 14005.735294
AS NaN NaN NaN 0.000000 0.022700 0.051100 0.000000 0.005700 0.108000 0.130700 28524.000000 NaN
AZ 560.000000 572.500000 505.000000 0.004801 0.020099 0.006194 0.006886 0.008667 0.251198 0.060630 40854.298851 12736.276596
CA 639.181818 638.757576 570.540000 0.013127 0.024717 0.006771 0.014038 0.029237 0.265019 0.084106 44461.812371 14126.306843
CO 582.000000 577.000000 570.000000 0.020616 0.036166 0.019154 0.015472 0.018813 0.200251 0.089144 41578.523077 15259.739726
CT 620.000000 616.812500 532.083333 0.018118 0.015522 0.026929 0.023099 0.038351 0.229379 0.090135 50771.413793 15401.830769
DC 642.800000 634.600000 605.000000 0.023706 0.052100 0.016976 0.022359 0.087912 0.186541 0.246371 51325.470588 19868.222222
DE 628.000000 620.000000 505.000000 0.009580 0.033613 0.008400 0.010433 0.019440 0.218613 0.125167 40319.153846 15387.500000

Aggregation methods

Some common aggregation methods include: .mean(), .sum(), .std(), .describe(), .min(), .max(), but there are many more. Any function that returns a scalar will work.

groupby( ) on a subset of columns

We may not care about all the columns in our datset for a particular groupby. We can subset our DataFrame as usual and compute a groupby. This saves computing statistics we do not want. For a large data set, this can save time.

Let's focus on the median SAT scores. We will group by the 'ownership' variable.

# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
sat_medians_1 = colscd[['sat_read_med', 'sat_math_med', 'sat_write_med', 'ownership']].groupby('ownership').median()
sat_medians_1
sat_read_med sat_math_med sat_write_med
ownership
Private nonprofit 580.0 570.0 509.0
Private profit 630.0 622.0 475.0
Public 565.0 551.0 495.0

Practice

  1. Create a dataset with only the public institutions from colscd DataFrame. Name it pub.

The public institutions are the ones with ownership equal to 'Public'.

pub = colscd[colscd['ownership']=='Public']

The .quantile( ) method (docs) computes quantiles from the data. (e.g., .quantile(0.5) computes the median, or the the 50th quantile)

  1. Let's look at a measure of the earnings variation for different institution types:

    a. Compute the 75th quantile for 'earn_10' for each type. Name the new DataFrame 'q75'.
    b. Compute the 25th quantile for 'earn_10' for each type. Name the new DataFrame 'q25'.

You should have two new DataFrames, each containing one of the quantile statistics.

q75 = pub[['earn_10', 'type']].groupby('type').quantile(0.75)
q25 = pub[['earn_10', 'type']].groupby('type').quantile(0.25)

2d. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

iqr = q75 - q25
iqr
earn_10
type
asc 6556.00
bach 14397.75
cert 8625.50
grad_only 3720.50
na 24715.00

Understanding the variation in income from different school types could be an interesting project. Let's practice some more.

  1. How do SAT reading scores and graduate earning correlate?

    a. Compute the median SAT reading score and median 10-year earnings by state.

sat_rw_med = pub[['sat_read_med', 'earn_10', 'state']].groupby('state').median()
sat_rw_med.head()
sat_read_med earn_10
state
AK 620.0 46085.0
AL 541.5 37250.0
AR 550.0 33405.0
AS NaN 28524.0
AZ 620.0 43959.0

3b. Create a scatter plot with the median SAT reading score on the x axis and median 10-year income on the y axis.

If you read the seaborn notebook from week 5, you can try a .regplot().

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(sat_rw_med['sat_read_med'], sat_rw_med['earn_10'], color='black', marker='*')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')

plt.show()

png

import seaborn as sns
fig, ax = plt.subplots(figsize=(10,6))

sns.regplot(x=sat_rw_med['sat_read_med'], y=sat_rw_med['earn_10'], color='black', marker='*')

ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')

sns.despine()

plt.show()

png

Several statistics at once

Once we have grouped our data, we have been applying methods to compute a single statistic: mean(), count(),...

We now introduced the .agg( ) method, which lets us compute several moments at once—you can even pass it a user-defined function.

# This is the same as groupby('type').count()
colscd[['earn_10', 'type']].groupby('type').agg('count')

#colscd[['earn_10', 'type']].groupby('type').count()
earn_10
type
asc 985
bach 1815
cert 1906
grad_only 21
na 438
# But agg() lets us compute many stats at once
colscd[['earn_10', 'sat_read_med', 'type']].groupby('type').agg(['count', 'mean', 'median', 'std', 'max'])
earn_10 sat_read_med
count mean median std max count mean median std max
type
asc 985 39093.609137 37765.0 9991.976462 107401.0 16 533.375000 534.5 64.785672 720.0
bach 1815 52838.200551 50273.0 15950.906184 129442.0 963 581.755971 575.0 65.794944 760.0
cert 1906 30573.251312 28400.5 10887.433084 95079.0 3 508.666667 493.0 29.771351 543.0
grad_only 21 63038.333333 58068.0 20185.906540 100141.0 0 NaN NaN NaN NaN
na 438 42066.890411 42432.0 13807.471374 99012.0 0 NaN NaN NaN NaN

Now we have a multiIndex on the columns.

We can use .agg() to pass a user-defined function, too. We will work on that in the practice problems.

groupby( ) with many keys

Can we group by several keys? You know we can. Let's compute the medians of the DataFrame this time.

all_medians = colscd.groupby(['ownership','type']).median(numeric_only=True)
all_medians
sat_read_med sat_math_med sat_write_med sh_com sh_cs sh_eng sh_bio sh_ss sh_health sh_bus earn_10 debt_at_grad
ownership type
Private nonprofit asc 534.0 503.0 455.0 0.00000 0.00000 0.00000 0.0000 0.00000 0.13220 0.00000 38461.0 16000.0
bach 580.0 570.0 510.0 0.01080 0.00930 0.00000 0.0419 0.01190 0.03520 0.15140 51312.0 25000.0
cert 493.0 465.0 505.0 0.00000 0.00000 0.00000 0.0000 0.00000 0.51185 0.00000 29012.0 12000.0
grad_only NaN NaN NaN 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000 58068.0 20000.0
na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 44642.0 23879.0
Private profit asc 540.0 505.0 NaN 0.00000 0.00000 0.00000 0.0000 0.00000 0.57960 0.00000 34922.0 18668.0
bach 641.5 627.0 475.0 0.00000 0.01350 0.00000 0.0000 0.00000 0.00370 0.06980 42990.0 26123.0
cert NaN NaN NaN 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000 25408.0 9500.0
grad_only NaN NaN NaN 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000 72945.0 32946.0
na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37168.0 18668.0
Public asc 527.5 497.5 430.5 0.00000 0.02740 0.00000 0.0000 0.00000 0.18480 0.09750 38011.0 10500.0
bach 565.0 555.0 495.0 0.03575 0.03270 0.00835 0.0515 0.04255 0.09500 0.16585 50715.5 20705.0
cert NaN NaN 391.0 0.00000 0.01445 0.00000 0.0000 0.00000 0.30915 0.02995 35473.0 9500.0
grad_only NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 68595.5 26814.0
na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 42689.0 17500.0

Now we have a multiIndexed DataFrame with the summary statistics, this time, the median.

The three ownership types (private nonprofit, private profit, and public) all have institutions that predominately offer bachelors degrees. How do the median SAT scores compare across public and private institutions?

Let's grab that set of statistics, a quick practice with multiIndex.

bach_sat_med = all_medians[['sat_read_med', 'sat_math_med', 'sat_write_med']].xs('bach', level='type')         # xs() indexes data from a MultiIndex
bach_sat_med
sat_read_med sat_math_med sat_write_med
ownership
Private nonprofit 580.0 570.0 510.0
Private profit 641.5 627.0 475.0
Public 565.0 555.0 495.0

Practice

  1. Below is a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame).

The input, x, will be a column of a DataFrame. The output is a single number.

def avg5(x):
    top5 = x.sort_values().iloc[-5:]     # sort x, then grab the five largest. 
    return top5.mean()                   # return the mean 
def avg5(x):
    top5 = x.sort_values().iloc[-5:]     # sort x, then grab the five largest. 
    return top5.mean()                   # return the mean 
  1. Test the function on column 'a' of the DataFrame defined below. The answer should be 8.
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })

The command would look like:

avg5(test['a'])
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
test
a b
0 1 2
1 4 3
2 6 4
3 9 5
4 10 6
5 3 7
6 7 8
7 8 10
avg5(test['a'])
8.0

Now return to colscd

  1. Drop any observation that has 'debt_at_grad' == NaN. Use the .dropna() method (docs).
colscd.dropna(subset=['debt_at_grad'], inplace=True)
  1. Compute the mean, median, and avg5 of 'debt_at_grad' by 'ownership'. Compute them all at once using .agg(). Since avg5 is user-defined, we pass the name of the function without quotes. i.e., .agg(['mean', avg5]).
colscd.dropna(subset=['debt_at_grad'] , inplace=True)   # Drop the missing values

colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5]) # Compute the summary stats
mean median avg5
ownership
Private nonprofit 22270.035165 24000.0 39597.2
Private profit 12646.573190 9500.0 42794.6
Public 14655.012642 12500.0 32100.0

Extra practice

If you want to practice some more, try writing two functions: One returns the 25 percentile and one returns the 75 percentile.

Then redo question 2 from the first practice, but using only one groupby and the .agg() method.

  1. Let's look at a measure of the earnings spread for different institution types
  2. Compute the 75th quantile for 'earn_10' for each 'type'.
  3. Compute the 25th quantile for 'earn_10' for each 'type'.

You should have only one new DataFrame containing all of the quantile statistics.

def qt75(x):
    return x.quantile(0.75)

def qt25(x):
    return x.quantile(0.25)

qts = colscd[['earn_10', 'type']].groupby('type').agg([qt25, qt75])
qts
earn_10
qt25 qt75
type
asc 35115.25 42925.0
bach 43634.00 60030.5
cert 23476.00 35856.0
grad_only 58068.00 68306.0
na 36470.00 44919.0

2d. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

qts[('earn_10', 'iqr')] = qts[('earn_10', 'qt75')] - qts[('earn_10', 'qt25')]
qts
earn_10
qt25 qt75 iqr
type
asc 35115.25 42925.0 7809.75
bach 43634.00 60030.5 16396.50
cert 23476.00 35856.0 12380.00
grad_only 58068.00 68306.0 10238.00
na 36470.00 44919.0 8449.00

[The simplest way to do this is to use a lambda function rather than write our own two functions. We have not covered these in class but it would not be difficult to learn on your own. Try searching for 'pandas agg pass argument.']