download notebook
view notebook w/ solutions

Groupby

files needed = ('Most-Recent-Cohorts-Institution.zip')

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have?

Pandas provides the groupby( ) method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as

split-apply-combine

Here is the canonical illustration. The big idea is to

  1. Split the data up into groups. The groups are defined by key variables.
  2. Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
  3. Combine the results of the 'apply' from each group into a new data structure.

Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here.

import pandas as pd
import matplotlib.pyplot as plt

College Scorecard

Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Dept. of Education to help students evaluate higher education institutions. The data are very well documented and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.

We will be working off of the 'institution-level' data file.

[There is an api, too, but we would all have to sign up for a key.]

The data set is large. There are thousands of variables. I am creating a dict that holds the variable name in the data set and the name I will use in my DataFrame. This will come in handy...

How did I figure out the variable names and definitions? I read the documentation.

variables = {'CONTROL':'ownership', 'INSTNM':'name', 
             'STABBR':'state', 'PREDDEG':'type', 
             'SATVRMID':'sat_read_med', 'SATMTMID':'sat_math_med', 'SATWRMID':'sat_write_med', 
             'PCIP52':'sh_bus', 'PCIP11':'sh_cs', 'PCIP14':'sh_eng', 'PCIP26':'sh_bio', 
             'PCIP45':'sh_ss', 'PCIP51':'sh_health', 'PCIP09':'sh_com',
             'MD_EARN_WNE_P10':'earn_10', 'GRAD_DEBT_MDN_SUPP':'debt_at_grad'}

I'm doing couple things here that we may not have discussed before.

  1. I'm reading the zip file without first decompressing it. Since there is only one file in the zipped package, pandas reads the file into a DataFrame. Pandas tries to guess the compression format. You can use the compression argument if it isn't able to infer it.

  2. I'm using the .keys() method of dict to provide the names of the columns that I want loaded. This saves me the trouble of loading all the columns and then keeping only the ones I want. [This is the first place I'll use the variables dict.]

colscd = pd.read_csv('Most-Recent-Cohorts-Institution.zip', 
                     na_values=['PrivacySuppressed', 'NULL'], 
                     usecols=variables.keys())

We are loading a new data set. Let's do our checks.

Here, I'm using the .info() method of DataFrame. It gives me the shape of the DataFrame and the dtypes.

colscd.info()
display(colscd.head(2))
display(colscd.tail(2))

Data preparation

Let's clean this up a bit.

  1. Let's rename the variables to something easier to understand. This is the second place I'm using the variables dict.
colscd = colscd.rename(columns = variables)
colscd.head(3)
  1. Let's recode the categorical variables. The ownership and type variables are coded as integers. I would rather they were easy to understand. We could use statements like
colscd.loc[colscd['type']==1, 'type'] = 'cert'

for each change we want to make, but that would get tedious. Instead, we will use the pandas method .replace(). We pass the .replace() method a dictionary in which the keys are the objects we want to replace and the values are what we want to replace them with.

Let's take a look.

type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)

own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()

Set the index to the university name. How does the UW look?

colscd.set_index('name', inplace=True)
colscd.loc['University of Wisconsin-Madison']

We are a public university that predominantly gives out bachelor degrees. The largest major is business, then biology and engineering. Median earnings 10 years after graduating is 70k and median debt at graduation is 20k.

Apply-split-combine

The data are in good shape. Let's explore .groupby().

We look at each step separately first, to see what it does, then we will chain them all together to apply-split-combine all in one line of code.

1. Split: groupby( )

We pass .groupby() a key which tells the method which variable to, well, group by. We will group by state. This is the split step.

What is colscd_grouped?

colscd_grouped = colscd.groupby('state')
print(type(colscd_grouped))

A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.

What does it look like? A DataFrameGroupBy is an iterable object. It returns subsets of the original DataFrame by group. In our case, the groups are defined by state.

The .get_group() returns a group.

colscd_grouped.get_group('WI').sort_index()

# colscd_grouped.get_group('AK').sort_index()

...and there is every degree-granting institution in Wisconsin from Academy of Beauty Professionals to the Wright Graduate University for the Realization of Human Potential.

2. + 3. Apply and combine

A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.

# apply the mean operator to the grouped data
all_means = colscd_grouped.mean(numeric_only=True)  

 # what do we get back?
print(type(all_means))            
# Ah, a DataFrame. We know what to do with that. 
all_means.head(10)

When we used .mean() on the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.

Notice that the categorical data (name, state, type) have been dropped. .mean() doesn't know what to do with string data.

1. + 2. + 3. Split-apply-combine

Computing the grouped data first helped us understand what was happening, but we typically do the whole split-apply-combine in one step. One simple line of code.

all_means = colscd.groupby('state').mean(numeric_only=True)
all_means.head(10)

Aggregation methods

Some common aggregation methods include: .mean(), .sum(), .std(), .describe(), .min(), .max(), but there are many more. Any function that returns a scalar will work.

groupby( ) on a subset of columns

We may not care about all the columns in our datset for a particular groupby. We can subset our DataFrame as usual and compute a groupby. This saves computing statistics we do not want. For a large data set, this can save time.

Let's focus on the median SAT scores. We will group by the 'ownership' variable.

# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
sat_medians_1 = colscd[['sat_read_med', 'sat_math_med', 'sat_write_med', 'ownership']].groupby('ownership').median()
sat_medians_1

Practice

  1. Create a dataset with only the public institutions from colscd DataFrame. Name it pub.

The public institutions are the ones with ownership equal to 'Public'.

The .quantile( ) method (docs) computes quantiles from the data. (e.g., .quantile(0.5) computes the median, or the the 50th quantile)

  1. Let's look at a measure of the earnings variation for different institution types:
    1. Compute the 75th quantile for 'earn_10' for each type. Name the new DataFrame 'q75'.
    2. Compute the 25th quantile for 'earn_10' for each type. Name the new DataFrame 'q25'.

You should have two new DataFrames, each containing one of the quantile statistics.

2C. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

Understanding the variation in income from different school types could be an interesting project. Let's practice some more.

  1. How do SAT reading scores and graduate earning correlate?

    a. Compute the median SAT reading score and median 10-year earnings by state.

3b. Create a scatter plot with the median SAT reading score on the x axis and median 10-year income on the y axis.

If you read the seaborn notebook from week 5, you can try a .regplot().

Several statistics at once

Once we have grouped our data, we have been applying methods to compute a single statistic: mean(), count(),...

We now introduced the .agg( ) method, which lets us compute several moments at once—you can even pass it a user-defined function.

# This is the same as groupby('type').count()
colscd[['earn_10', 'type']].groupby('type').agg('count')

#colscd[['earn_10', 'type']].groupby('type').count()
# But agg() lets us compute many stats at once
colscd[['earn_10', 'sat_read_med', 'type']].groupby('type').agg(['count', 'mean', 'median', 'std', 'max'])

Now we have a multiIndex on the columns.

We can use .agg() to pass a user-defined function, too. We will work on that in the practice problems.

groupby( ) with many keys

Can we group by several keys? You know we can. Let's compute the medians of the DataFrame this time.

all_medians = colscd.groupby(['ownership','type']).median(numeric_only=True)
all_medians

Now we have a multiIndexed DataFrame with the summary statistics, this time, the median.

The three ownership types (private nonprofit, private profit, and public) all have institutions that predominately offer bachelors degrees. How do the median SAT scores compare across public and private institutions?

Let's grab that set of statistics, a quick practice with multiIndex.

bach_sat_med = all_medians[['sat_read_med', 'sat_math_med', 'sat_write_med']].xs('bach', level='type')         # xs() indexes data from a MultiIndex
bach_sat_med

Practice

  1. Below is a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame).

The input, x, will be a column of a DataFrame. The output is a single number.

def avg5(x):
    top5 = x.sort_values().iloc[-5:]     # sort x, then grab the five largest. 
    return top5.mean()                   # return the mean 
  1. Test the function on column 'a' of the DataFrame defined below. The answer should be 8.
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })

The command would look like:

avg5(test['a'])

Now return to colscd

  1. Drop any observation that has 'debt_at_grad' == NaN. Use the .dropna() method (docs).
colscd.dropna(subset=['debt_at_grad'], inplace=True)
  1. Compute the mean, median, and avg5 of 'debt_at_grad' by 'ownership'. Compute them all at once using .agg(). Since avg5 is user-defined, we pass the name of the function without quotes. i.e., .agg(['mean', avg5]).

Extra practice

If you want to practice some more, try writing two functions: One returns the 25 percentile and one returns the 75 percentile.

Then redo question 2 from the first practice, but using only one groupby and the .agg() method.

  1. Let's look at a measure of the earnings spread for different institution types
  2. Compute the 75th quantile for 'earn_10' for each 'type'.
  3. Compute the 25th quantile for 'earn_10' for each 'type'.

You should have only one new DataFrame containing all of the quantile statistics.

2C. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

[The simplest way to do this is to use a lambda function rather than write our own two functions. We have not covered these in class but it would not be difficult to learn on your own. Try searching for 'pandas agg pass argument.']