download notebook
view notebook w/o solutions

Groupby

files needed = ('Most-Recent-Cohorts-Institution.zip')

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have?

Pandas provides the groupby( ) method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as

split-apply-combine

Here is the canonical illustration. The big idea is to

Split the data up into groups. The groups are defined by key variables.
Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
Combine the results of the 'apply' from each group into a new data structure.

Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here.

import pandas as pd
import matplotlib.pyplot as plt

College Scorecard

Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Dept. of Education to help students evaluate higher education institutions. The data are very well documented and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.

We will be working off of the 'institution-level' data file.

[There is an api, too, but we would all have to sign up for a key.]

The data set is large. There are thousands of variables. I am creating a dict that holds the variable name in the data set and the name I will use in my DataFrame. This will come in handy...

How did I figure out the variable names and definitions? I read the documentation.

variables = {'CONTROL':'ownership', 'INSTNM':'name', 
             'STABBR':'state', 'PREDDEG':'type', 
             'SATVRMID':'sat_read_med', 'SATMTMID':'sat_math_med', 'SATWRMID':'sat_write_med', 
             'PCIP52':'sh_bus', 'PCIP11':'sh_cs', 'PCIP14':'sh_eng', 'PCIP26':'sh_bio', 
             'PCIP45':'sh_ss', 'PCIP51':'sh_health', 'PCIP09':'sh_com',
             'MD_EARN_WNE_P10':'earn_10', 'GRAD_DEBT_MDN_SUPP':'debt_at_grad'}

I'm doing couple things here that we may not have discussed before.

I'm reading the zip file without first decompressing it. Since there is only one file in the zipped package, pandas reads the file into a DataFrame. Pandas tries to guess the compression format. You can use the compression argument if it isn't able to infer it.
I'm using the .keys() method of dict to provide the names of the columns that I want loaded. This saves me the trouble of loading all the columns and then keeping only the ones I want. [This is the first place I'll use the variables dict.]

colscd = pd.read_csv('Most-Recent-Cohorts-Institution.zip', 
                     na_values=['PrivacySuppressed', 'NULL'], 
                     usecols=variables.keys())

We are loading a new data set. Let's do our checks.

Here, I'm using the .info() method of DataFrame. It gives me the shape of the DataFrame and the dtypes.

colscd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6543 entries, 0 to 6542
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   INSTNM              6543 non-null   object 
 1   STABBR              6543 non-null   object 
 2   PREDDEG             6543 non-null   int64  
 3   CONTROL             6543 non-null   int64  
 4   SATVRMID            982 non-null    float64
 5   SATMTMID            982 non-null    float64
 6   SATWRMID            702 non-null    float64
 7   PCIP09              5765 non-null   float64
 8   PCIP11              5765 non-null   float64
 9   PCIP14              5765 non-null   float64
 10  PCIP26              5765 non-null   float64
 11  PCIP45              5765 non-null   float64
 12  PCIP51              5765 non-null   float64
 13  PCIP52              5765 non-null   float64
 14  MD_EARN_WNE_P10     5165 non-null   float64
 15  GRAD_DEBT_MDN_SUPP  4894 non-null   float64
dtypes: float64(12), int64(2), object(2)
memory usage: 818.0+ KB

display(colscd.head(2))
display(colscd.tail(2))

	INSTNM	STABBR	PREDDEG	CONTROL	SATVRMID	SATMTMID	SATWRMID	PCIP09	PCIP11	PCIP14	PCIP26	PCIP45	PCIP51	PCIP52	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
0	Alabama A & M University	AL	3	1	475.0	460.0	414.0	0.0000	0.0658	0.1210	0.1690	0.0249	0.0000	0.1637	39059.0	31000.0
1	University of Alabama at Birmingham	AL	3	1	637.0	644.0	NaN	0.0333	0.0229	0.0559	0.1125	0.0239	0.2088	0.2159	50907.0	22300.0

	INSTNM	STABBR	PREDDEG	CONTROL	SATVRMID	SATMTMID	SATWRMID	PCIP09	PCIP11	PCIP14	PCIP26	PCIP45	PCIP51	PCIP52	MD_EARN_WNE_P10	GRAD_DEBT_MDN_SUPP
6541	Pennsylvania State University-Penn State Brand...	PA	0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	61185.0	25000.0
6542	Burlington County Institute of Technology - Ad...	NJ	0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Data preparation

Let's clean this up a bit.

Let's rename the variables to something easier to understand. This is the second place I'm using the variables dict.

colscd = colscd.rename(columns = variables)
colscd.head(3)

	name	state	type	ownership	sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
0	Alabama A & M University	AL	3	1	475.0	460.0	414.0	0.0000	0.0658	0.1210	0.1690	0.0249	0.0000	0.1637	39059.0	31000.0
1	University of Alabama at Birmingham	AL	3	1	637.0	644.0	NaN	0.0333	0.0229	0.0559	0.1125	0.0239	0.2088	0.2159	50907.0	22300.0
2	Amridge University	AL	3	2	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.5000	37730.0	32189.0

Let's recode the categorical variables. The ownership and type variables are coded as integers. I would rather they were easy to understand. We could use statements like

colscd.loc[colscd['type']==1, 'type'] = 'cert'

for each change we want to make, but that would get tedious. Instead, we will use the pandas method .replace(). We pass the .replace() method a dictionary in which the keys are the objects we want to replace and the values are what we want to replace them with.

Let's take a look.

type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)

own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()

	name	state	type	ownership	sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
0	Alabama A & M University	AL	bach	Public	475.0	460.0	414.0	0.0000	0.0658	0.1210	0.1690	0.0249	0.0000	0.1637	39059.0	31000.0
1	University of Alabama at Birmingham	AL	bach	Public	637.0	644.0	NaN	0.0333	0.0229	0.0559	0.1125	0.0239	0.2088	0.2159	50907.0	22300.0
2	Amridge University	AL	bach	Private nonprofit	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.5000	37730.0	32189.0
3	University of Alabama in Huntsville	AL	bach	Public	640.0	655.0	NaN	0.0140	0.0692	0.3028	0.0504	0.0127	0.1341	0.1930	56901.0	20705.0
4	Alabama State University	AL	bach	Public	488.0	457.0	NaN	0.0923	0.0530	0.0138	0.0786	0.0196	0.0904	0.1513	34018.0	31000.0

Set the index to the university name. How does the UW look?

colscd.set_index('name', inplace=True)
colscd.loc['University of Wisconsin-Madison']

state                 WI
type                bach
ownership         Public
sat_read_med       680.0
sat_math_med       730.0
sat_write_med      640.0
sh_com            0.0616
sh_cs             0.0586
sh_eng            0.0909
sh_bio            0.1025
sh_ss             0.0982
sh_health         0.0732
sh_bus            0.1478
earn_10          70586.0
debt_at_grad     20484.0
Name: University of Wisconsin-Madison, dtype: object

We are a public university that predominantly gives out bachelor degrees. The largest major is business, then biology and engineering. Median earnings 10 years after graduating is 70k and median debt at graduation is 20k.

Apply-split-combine

The data are in good shape. Let's explore .groupby().

We look at each step separately first, to see what it does, then we will chain them all together to apply-split-combine all in one line of code.

1. Split: groupby( )

We pass .groupby() a key which tells the method which variable to, well, group by. We will group by state. This is the split step.

What is colscd_grouped?

colscd_grouped = colscd.groupby('state')
print(type(colscd_grouped))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.

What does it look like? A DataFrameGroupBy is an iterable object. It returns subsets of the original DataFrame by group. In our case, the groups are defined by state.

The .get_group() returns a group.

colscd_grouped.get_group('WI').sort_index()

# colscd_grouped.get_group('AK').sort_index()

	state	type	ownership	sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
name
Academy of Beauty Professionals	WI	cert	Private profit	NaN	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	23064.0	6333.0
Academy of Beauty Professionals	WI	cert	Private profit	NaN	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	23064.0	6333.0
Academy of Beauty Professionals	WI	cert	Private profit	NaN	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	23064.0	6333.0
Advanced College of Cosmetology	WI	cert	Private profit	NaN	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	NaN	NaN
Alverno College	WI	bach	Private nonprofit	NaN	NaN	498.0	0.0284	0.0000	0.0	0.0426	0.0106	0.4397	0.0816	48547.0	27000.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Waukesha County Technical College	WI	cert	Public	NaN	NaN	NaN	0.0000	0.0467	0.0	0.0000	0.0000	0.4476	0.1053	43520.0	12000.0
Western Technical College	WI	asc	Public	NaN	NaN	NaN	0.0165	0.0330	0.0	0.0000	0.0000	0.4258	0.1940	42215.0	11500.0
Wisconsin Lutheran College	WI	bach	Private nonprofit	540.0	550.0	NaN	0.0356	0.0222	0.0	0.0978	0.0311	0.1156	0.3022	51291.0	26000.0
Wisconsin School of Professional Psychology	WI	grad_only	Private nonprofit	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Wright Graduate University for the Realization of Human Potential	WI	grad_only	Private nonprofit	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

93 rows × 15 columns

...and there is every degree-granting institution in Wisconsin from Academy of Beauty Professionals to the Wright Graduate University for the Realization of Human Potential.

2. + 3. Apply and combine

A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.

# apply the mean operator to the grouped data
all_means = colscd_grouped.mean(numeric_only=True)  

 # what do we get back?
print(type(all_means))

<class 'pandas.core.frame.DataFrame'>

# Ah, a DataFrame. We know what to do with that. 
all_means.head(10)

	sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
state
AK	620.000000	615.000000	482.500000	0.002011	0.023656	0.012044	0.010633	0.012256	0.217022	0.164744	43604.500000	18097.800000
AL	551.444444	545.111111	498.000000	0.013615	0.023825	0.014719	0.026375	0.010513	0.226493	0.122677	38225.338028	19492.215385
AR	546.285714	538.714286	464.200000	0.007577	0.016767	0.004451	0.015688	0.009674	0.203054	0.091672	32972.202703	14005.735294
AS	NaN	NaN	NaN	0.000000	0.022700	0.051100	0.000000	0.005700	0.108000	0.130700	28524.000000	NaN
AZ	560.000000	572.500000	505.000000	0.004801	0.020099	0.006194	0.006886	0.008667	0.251198	0.060630	40854.298851	12736.276596
CA	639.181818	638.757576	570.540000	0.013127	0.024717	0.006771	0.014038	0.029237	0.265019	0.084106	44461.812371	14126.306843
CO	582.000000	577.000000	570.000000	0.020616	0.036166	0.019154	0.015472	0.018813	0.200251	0.089144	41578.523077	15259.739726
CT	620.000000	616.812500	532.083333	0.018118	0.015522	0.026929	0.023099	0.038351	0.229379	0.090135	50771.413793	15401.830769
DC	642.800000	634.600000	605.000000	0.023706	0.052100	0.016976	0.022359	0.087912	0.186541	0.246371	51325.470588	19868.222222
DE	628.000000	620.000000	505.000000	0.009580	0.033613	0.008400	0.010433	0.019440	0.218613	0.125167	40319.153846	15387.500000

When we used .mean() on the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.

Notice that the categorical data (name, state, type) have been dropped. .mean() doesn't know what to do with string data.

1. + 2. + 3. Split-apply-combine

Computing the grouped data first helped us understand what was happening, but we typically do the whole split-apply-combine in one step. One simple line of code.

all_means = colscd.groupby('state').mean(numeric_only=True)
all_means.head(10)

	sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
state
AK	620.000000	615.000000	482.500000	0.002011	0.023656	0.012044	0.010633	0.012256	0.217022	0.164744	43604.500000	18097.800000
AL	551.444444	545.111111	498.000000	0.013615	0.023825	0.014719	0.026375	0.010513	0.226493	0.122677	38225.338028	19492.215385
AR	546.285714	538.714286	464.200000	0.007577	0.016767	0.004451	0.015688	0.009674	0.203054	0.091672	32972.202703	14005.735294
AS	NaN	NaN	NaN	0.000000	0.022700	0.051100	0.000000	0.005700	0.108000	0.130700	28524.000000	NaN
AZ	560.000000	572.500000	505.000000	0.004801	0.020099	0.006194	0.006886	0.008667	0.251198	0.060630	40854.298851	12736.276596
CA	639.181818	638.757576	570.540000	0.013127	0.024717	0.006771	0.014038	0.029237	0.265019	0.084106	44461.812371	14126.306843
CO	582.000000	577.000000	570.000000	0.020616	0.036166	0.019154	0.015472	0.018813	0.200251	0.089144	41578.523077	15259.739726
CT	620.000000	616.812500	532.083333	0.018118	0.015522	0.026929	0.023099	0.038351	0.229379	0.090135	50771.413793	15401.830769
DC	642.800000	634.600000	605.000000	0.023706	0.052100	0.016976	0.022359	0.087912	0.186541	0.246371	51325.470588	19868.222222
DE	628.000000	620.000000	505.000000	0.009580	0.033613	0.008400	0.010433	0.019440	0.218613	0.125167	40319.153846	15387.500000

Aggregation methods

Some common aggregation methods include: .mean(), .sum(), .std(), .describe(), .min(), .max(), but there are many more. Any function that returns a scalar will work.

groupby( ) on a subset of columns

We may not care about all the columns in our datset for a particular groupby. We can subset our DataFrame as usual and compute a groupby. This saves computing statistics we do not want. For a large data set, this can save time.

Let's focus on the median SAT scores. We will group by the 'ownership' variable.

# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
sat_medians_1 = colscd[['sat_read_med', 'sat_math_med', 'sat_write_med', 'ownership']].groupby('ownership').median()
sat_medians_1

	sat_read_med	sat_math_med	sat_write_med
ownership
Private nonprofit	580.0	570.0	509.0
Private profit	630.0	622.0	475.0
Public	565.0	551.0	495.0

Practice

Create a dataset with only the public institutions from colscd DataFrame. Name it pub.

The public institutions are the ones with ownership equal to 'Public'.

pub = colscd[colscd['ownership']=='Public']

The .quantile( ) method (docs) computes quantiles from the data. (e.g., .quantile(0.5) computes the median, or the the 50th quantile)

Let's look at a measure of the earnings variation for different institution types:

a. Compute the 75th quantile for 'earn_10' for each type. Name the new DataFrame 'q75'.
b. Compute the 25th quantile for 'earn_10' for each type. Name the new DataFrame 'q25'.

You should have two new DataFrames, each containing one of the quantile statistics.

q75 = pub[['earn_10', 'type']].groupby('type').quantile(0.75)
q25 = pub[['earn_10', 'type']].groupby('type').quantile(0.25)

2d. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

iqr = q75 - q25
iqr

	earn_10
type
asc	6556.00
bach	14397.75
cert	8625.50
grad_only	3720.50
na	24715.00

Understanding the variation in income from different school types could be an interesting project. Let's practice some more.

How do SAT reading scores and graduate earning correlate?

a. Compute the median SAT reading score and median 10-year earnings by state.

sat_rw_med = pub[['sat_read_med', 'earn_10', 'state']].groupby('state').median()
sat_rw_med.head()

	sat_read_med	earn_10
state
AK	620.0	46085.0
AL	541.5	37250.0
AR	550.0	33405.0
AS	NaN	28524.0
AZ	620.0	43959.0

3b. Create a scatter plot with the median SAT reading score on the x axis and median 10-year income on the y axis.

If you read the seaborn notebook from week 5, you can try a .regplot().

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(sat_rw_med['sat_read_med'], sat_rw_med['earn_10'], color='black', marker='*')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')

plt.show()

png

import seaborn as sns
fig, ax = plt.subplots(figsize=(10,6))

sns.regplot(x=sat_rw_med['sat_read_med'], y=sat_rw_med['earn_10'], color='black', marker='*')

ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median 10-year earnings')

sns.despine()

plt.show()

png

Several statistics at once

Once we have grouped our data, we have been applying methods to compute a single statistic: mean(), count(),...

We now introduced the .agg( ) method, which lets us compute several moments at once—you can even pass it a user-defined function.

# This is the same as groupby('type').count()
colscd[['earn_10', 'type']].groupby('type').agg('count')

#colscd[['earn_10', 'type']].groupby('type').count()

	earn_10
type
asc	985
bach	1815
cert	1906
grad_only	21
na	438

# But agg() lets us compute many stats at once
colscd[['earn_10', 'sat_read_med', 'type']].groupby('type').agg(['count', 'mean', 'median', 'std', 'max'])

	earn_10					sat_read_med
	count	mean	median	std	max	count	mean	median	std	max
type
asc	985	39093.609137	37765.0	9991.976462	107401.0	16	533.375000	534.5	64.785672	720.0
bach	1815	52838.200551	50273.0	15950.906184	129442.0	963	581.755971	575.0	65.794944	760.0
cert	1906	30573.251312	28400.5	10887.433084	95079.0	3	508.666667	493.0	29.771351	543.0
grad_only	21	63038.333333	58068.0	20185.906540	100141.0	0	NaN	NaN	NaN	NaN
na	438	42066.890411	42432.0	13807.471374	99012.0	0	NaN	NaN	NaN	NaN

Now we have a multiIndex on the columns.

We can use .agg() to pass a user-defined function, too. We will work on that in the practice problems.

groupby( ) with many keys

Can we group by several keys? You know we can. Let's compute the medians of the DataFrame this time.

all_medians = colscd.groupby(['ownership','type']).median(numeric_only=True)
all_medians

		sat_read_med	sat_math_med	sat_write_med	sh_com	sh_cs	sh_eng	sh_bio	sh_ss	sh_health	sh_bus	earn_10	debt_at_grad
ownership	type
Private nonprofit	asc	534.0	503.0	455.0	0.00000	0.00000	0.00000	0.0000	0.00000	0.13220	0.00000	38461.0	16000.0
	bach	580.0	570.0	510.0	0.01080	0.00930	0.00000	0.0419	0.01190	0.03520	0.15140	51312.0	25000.0
	cert	493.0	465.0	505.0	0.00000	0.00000	0.00000	0.0000	0.00000	0.51185	0.00000	29012.0	12000.0
	grad_only	NaN	NaN	NaN	0.00000	0.00000	0.00000	0.0000	0.00000	0.00000	0.00000	58068.0	20000.0
	na	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	44642.0	23879.0
Private profit	asc	540.0	505.0	NaN	0.00000	0.00000	0.00000	0.0000	0.00000	0.57960	0.00000	34922.0	18668.0
	bach	641.5	627.0	475.0	0.00000	0.01350	0.00000	0.0000	0.00000	0.00370	0.06980	42990.0	26123.0
	cert	NaN	NaN	NaN	0.00000	0.00000	0.00000	0.0000	0.00000	0.00000	0.00000	25408.0	9500.0
	grad_only	NaN	NaN	NaN	0.00000	0.00000	0.00000	0.0000	0.00000	0.00000	0.00000	72945.0	32946.0
	na	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	37168.0	18668.0
Public	asc	527.5	497.5	430.5	0.00000	0.02740	0.00000	0.0000	0.00000	0.18480	0.09750	38011.0	10500.0
	bach	565.0	555.0	495.0	0.03575	0.03270	0.00835	0.0515	0.04255	0.09500	0.16585	50715.5	20705.0
	cert	NaN	NaN	391.0	0.00000	0.01445	0.00000	0.0000	0.00000	0.30915	0.02995	35473.0	9500.0
	grad_only	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	68595.5	26814.0
	na	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	42689.0	17500.0

Now we have a multiIndexed DataFrame with the summary statistics, this time, the median.

The three ownership types (private nonprofit, private profit, and public) all have institutions that predominately offer bachelors degrees. How do the median SAT scores compare across public and private institutions?

Let's grab that set of statistics, a quick practice with multiIndex.

bach_sat_med = all_medians[['sat_read_med', 'sat_math_med', 'sat_write_med']].xs('bach', level='type')         # xs() indexes data from a MultiIndex
bach_sat_med

	sat_read_med	sat_math_med	sat_write_med
ownership
Private nonprofit	580.0	570.0	510.0
Private profit	641.5	627.0	475.0
Public	565.0	555.0	495.0

Practice

Below is a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame).

The input, x, will be a column of a DataFrame. The output is a single number.

def avg5(x):
    top5 = x.sort_values().iloc[-5:]     # sort x, then grab the five largest. 
    return top5.mean()                   # return the mean

def avg5(x):
    top5 = x.sort_values().iloc[-5:]     # sort x, then grab the five largest. 
    return top5.mean()                   # return the mean

Test the function on column 'a' of the DataFrame defined below. The answer should be 8.

test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })

The command would look like:

avg5(test['a'])

test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
test

	a	b
0	1	2
1	4	3
2	6	4
3	9	5
4	10	6
5	3	7
6	7	8
7	8	10

avg5(test['a'])

8.0

Now return to colscd

Drop any observation that has 'debt_at_grad' == NaN. Use the .dropna() method (docs).

colscd.dropna(subset=['debt_at_grad'], inplace=True)

Compute the mean, median, and avg5 of 'debt_at_grad' by 'ownership'. Compute them all at once using .agg(). Since avg5 is user-defined, we pass the name of the function without quotes. i.e., .agg(['mean', avg5]).

colscd.dropna(subset=['debt_at_grad'] , inplace=True)   # Drop the missing values

colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5]) # Compute the summary stats

	mean	median	avg5
ownership
Private nonprofit	22270.035165	24000.0	39597.2
Private profit	12646.573190	9500.0	42794.6
Public	14655.012642	12500.0	32100.0

Extra practice

If you want to practice some more, try writing two functions: One returns the 25 percentile and one returns the 75 percentile.

Then redo question 2 from the first practice, but using only one groupby and the .agg() method.

Let's look at a measure of the earnings spread for different institution types
Compute the 75th quantile for 'earn_10' for each 'type'.
Compute the 25th quantile for 'earn_10' for each 'type'.

You should have only one new DataFrame containing all of the quantile statistics.

def qt75(x):
    return x.quantile(0.75)

def qt25(x):
    return x.quantile(0.25)

qts = colscd[['earn_10', 'type']].groupby('type').agg([qt25, qt75])
qts

	earn_10
	qt25	qt75
type
asc	35115.25	42925.0
bach	43634.00	60030.5
cert	23476.00	35856.0
grad_only	58068.00	68306.0
na	36470.00	44919.0

2d. For each type, compute the difference between the 75 percentile and the 25 percentile.

This is sometimes called the inter-quartile range. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation.

qts[('earn_10', 'iqr')] = qts[('earn_10', 'qt75')] - qts[('earn_10', 'qt25')]
qts

	earn_10
	qt25	qt75	iqr
type
asc	35115.25	42925.0	7809.75
bach	43634.00	60030.5	16396.50
cert	23476.00	35856.0	12380.00
grad_only	58068.00	68306.0	10238.00
na	36470.00	44919.0	8449.00

[The simplest way to do this is to use a lambda function rather than write our own two functions. We have not covered these in class but it would not be difficult to learn on your own. Try searching for 'pandas agg pass argument.']