download notebook
view notebook w/o solutions
Difference in differences
Let's practice our regression in python by replicating Card and Krueger (1994), "Minimum wages and employment," one of the most famous differnce-in-differences regressions in the literature.
The data are available from Card's website. Download the zip file and unzip it into your working directory.
The paper is also at Card's website.
There are several people on the web who have attempted this replication. I found Aaron Mamula's work useful.
Background
How does the minimum wage affect unemployment? Theory predicts that a binding minimum wage should decrease employment in occupations or worker types whose equilibrium wage would be less than the minimum wage. This paper looks at a change in the minimum wage in New Jersey from $4.25 to $5.05 in 1992. Pennsylvania, a neighboring state did not change its minimum wage. So the first difference is the before and after 1992 and second difference is PA vs NJ.
Card and Krueger collected data from fast-food restaurants before and after the policy change. Fast food is a good industry to study because they hire a lot of low-wage workers, including teenagers, who are mostly likely be affected by the change in policy.
import pandas as pd # for data handling
import numpy as np # for numerical methods and data structures
from pandas import option_context
import statsmodels.formula.api as smf
Loading the data
Card and Krueger were using SAS. Getting the data into a DataFrame took a bit of hassle. I dug out the variable names from the codebook and created the list below.
For practice, you could try reading in the codebook and using readline()
and the like to extract the variable names and create the list.
c = ['SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ','CENTRALJ','NORTHJ','PA1','PA2','SHORE','NCALLS','EMPFT','EMPPT',
'NMGRS' ,'WAGE_ST','INCTIME','FIRSTINC','BONUS','PCTAFF','MEALS','OPEN','HRSOPEN','PSODA','PFRY','PENTREE','NREGS',
'NREGS11','TYPE2','STATUS2','DATE2','NCALLS2','EMPFT2','EMPPT2','NMGRS2','WAGE_ST2','INCTIME2','FIRSTIN2','SPECIAL2',
'MEALS2','OPEN2R','HRSOPEN2','PSODA2','PFRY2','PENTREE2','NREGS2','NREGS112']
The data file is in fixed-width format. I'm using pd.read_fwf()
to read in the file. Pandas will guess at the layout (how many columns define a variable), and it got it right this time. If pandas cannot figure it out, you can pass it a column spec. (docs)
I'm only keeping the columns I need. Then I'm recoding the state variable so I don't have to remember that NJ = 1
.
ck = pd.read_fwf('njmin/public.dat', header=None, na_values='.', names=c)
ck = ck[['SHEET', 'STATE', 'EMPFT', 'EMPPT', 'EMPFT2', 'EMPPT2', 'NMGRS', 'NMGRS2',
'STATUS2', 'WAGE_ST', 'WAGE_ST2', 'CO_OWNED', 'CHAIN',
'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1', 'PA2', 'SHORE']]
ck['STATE'] = ck['STATE'].replace({1:'NJ', 0:'PA'})
ck.head(3)
SHEET | STATE | EMPFT | EMPPT | EMPFT2 | EMPPT2 | NMGRS | NMGRS2 | STATUS2 | WAGE_ST | WAGE_ST2 | CO_OWNED | CHAIN | SOUTHJ | CENTRALJ | NORTHJ | PA1 | PA2 | SHORE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 46 | PA | 30.0 | 15.0 | 3.5 | 35.0 | 3.0 | 3.0 | 1 | NaN | 4.30 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 49 | PA | 6.5 | 6.5 | 0.0 | 15.0 | 4.0 | 4.0 | 1 | NaN | 4.45 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 506 | PA | 3.0 | 7.0 | 3.0 | 7.0 | 2.0 | 4.0 | 1 | NaN | 5.00 | 1 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
Create the full-time equivalent variable
The total employment variable is full-time employees, plus managers, plus one-half the part-time employees.
ck['FTE_before'] = ck['EMPFT'] + ck['NMGRS'] + 0.5*ck['EMPPT']
ck['FTE_after'] = ck['EMPFT2'] + ck['NMGRS2'] + 0.5*ck['EMPPT2']
CK's Table 3
Here is the northwest corner of table 3. Let's recreate it.
PA | NJ | NJ-PA | |
---|---|---|---|
FTE before | 23.33 | 20.44 | -2.89 |
(1.35) | (0.51) | (1.44) | |
FTE after | 21.17 | 21.03 | -0.14 |
(0.94) | (0.52) | (1.07) |
ck['NJ-PA'] = ck['FTE_before'] - ck['FTE_after']
t3 = ck[['STATE', 'FTE_before', 'FTE_after']].groupby('STATE').agg(['mean', 'sem']).transpose()
t3 = t3[['PA', 'NJ']]
t3['NJ-PA'] = t3['NJ'] - t3['PA']
# Fix the standard error of the mean differences
t3.loc[('FTE_before', 'sem'), 'NJ-PA'] = (t3.loc[('FTE_before', 'sem'), 'PA']**2 + t3.loc[('FTE_before', 'sem'), 'NJ']**2)**(0.5)
t3.loc[('FTE_after', 'sem'), 'NJ-PA'] = (t3.loc[('FTE_after', 'sem'), 'PA']**2 + t3.loc[('FTE_after', 'sem'), 'NJ']**2)**(0.5)
with option_context('display.precision', 2):
display(t3)
STATE | PA | NJ | NJ-PA | |
---|---|---|---|---|
FTE_before | mean | 23.33 | 20.41 | -2.92 |
sem | 1.35 | 0.51 | 1.44 | |
FTE_after | mean | 21.17 | 21.03 | -0.14 |
sem | 0.94 | 0.52 | 1.08 |
The difference between the PA and NJ stores actually shrinks after NJ raises its minimum wage. NJ stores are, on average, hiring more people and PA stores are hiring fewer people.
Interesting first cut of the data. The next step is to control for some other variables and see if the result holds up.
CK's Table 4, model 1
Here goes the first regression.
where \(\Delta E_i\) is the change in employment (in store \(i\)) before and after the minimum wage change and NJ is equal to 1 if the store is in New Jersey and 0 otherwise.
# I'm trying to follow the sample selection given in the notes of table 4. I have 6 too few stores. Not sure why.
# employment>0 before AND employment>0 after AND reported wages before AND reported wages after
rd = ck[ (ck['FTE_after'].isna()==False) & (ck['FTE_before'].isna()==False) & (ck['WAGE_ST'].isna()==False) & (ck['WAGE_ST2'].isna()==False) ].copy()
rd['Emp_diff'] = rd['FTE_after'] - rd['FTE_before']
res1 = smf.ols("Emp_diff ~ C(STATE, Treatment(reference='PA'))", data=rd).fit()
ser1 = res1.scale**0.5
print(res1.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
OLS Regression Results
==============================================================================
Dep. Variable: Emp_diff R-squared: 0.011
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 3.773
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.0529
Time: 10:18:11 Log-Likelihood: -1256.9
No. Observations: 351 AIC: 2518.
Df Residuals: 349 BIC: 2526.
Df Model: 1
Covariance Type: nonrobust
=============================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------
Intercept -1.8788 1.073 -1.752 0.081 -3.988 0.231
C(STATE, Treatment(reference='PA'))[T.NJ] 2.3119 1.190 1.942 0.053 -0.029 4.653
==============================================================================
Omnibus: 39.407 Durbin-Watson: 2.128
Prob(Omnibus): 0.000 Jarque-Bera (JB): 103.223
Skew: -0.527 Prob(JB): 3.85e-23
Kurtosis: 5.439 Cond. No. 4.41
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The standard error of the regression is 8.71
Increasing the minimum wage did not cause employment to fall. That result earned this paper 5k google scholar citations.
My results are close, but not identical to those in the published paper. I don't have the sample selection correct. I suspect I could run the SAS code and figure it out...
Practice
- Try some of the other regressions in table 4. For models (iii)–(v) you will need to create the gap measure. It is defined on page 779.
You likely won't get the coefficient exactly right—I haven't been able to match the sample correctly. But your numbers should be in the ball park. -
Once you have estimated a couple models, try putting the results into a table similar to table 4.
-
CHAIN
is the control for chain stores, it takes values 1, 2, 3, or 4. CO_OWNED
is the control for company ownership (rather than franchise). It takes 0 or 1 as values.WAGE_ST
is the wage in the "before" period.SOUTHJ
,CENTRALJ
,NORTHJ
,PA1
,PA2
,SHORE
are the controls for regions. Each takes 0 or 1 as values.
Model 2
res2 = smf.ols("Emp_diff ~ C(STATE, Treatment(reference='PA')) + C(CHAIN) + CO_OWNED", data=rd).fit()
ser2 = res2.scale**0.5
print(res2.summary())
print(f'\nThe standard error of the regression is {ser2:.2f}')
OLS Regression Results
==============================================================================
Dep. Variable: Emp_diff R-squared: 0.021
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 1.483
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.195
Time: 10:18:11 Log-Likelihood: -1255.1
No. Observations: 351 AIC: 2522.
Df Residuals: 345 BIC: 2545.
Df Model: 5
Covariance Type: nonrobust
=============================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------
Intercept -1.4080 1.209 -1.164 0.245 -3.787 0.971
C(STATE, Treatment(reference='PA'))[T.NJ] 2.3194 1.197 1.938 0.053 -0.034 4.673
C(CHAIN)[T.2] 0.1696 1.296 0.131 0.896 -2.380 2.719
C(CHAIN)[T.3] -2.1435 1.321 -1.623 0.106 -4.741 0.454
C(CHAIN)[T.4] -0.8225 1.491 -0.552 0.581 -3.754 2.109
CO_OWNED 0.3514 1.098 0.320 0.749 -1.809 2.512
==============================================================================
Omnibus: 42.947 Durbin-Watson: 2.134
Prob(Omnibus): 0.000 Jarque-Bera (JB): 113.927
Skew: -0.573 Prob(JB): 1.82e-25
Kurtosis: 5.545 Cond. No. 5.75
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The standard error of the regression is 8.72
Model 3
rd['gap'] = 0
rd.loc[ (rd['WAGE_ST']<5.05) & (rd['STATE']=='NJ'), 'gap'] = (5.05-rd['WAGE_ST'])/rd['WAGE_ST']
res3 = smf.ols("Emp_diff ~ gap", data=rd).fit()
ser3 = res3.scale**0.5
print(res3.summary())
print(f'\nThe standard error of the regression is {ser3:.2f}')
OLS Regression Results
==============================================================================
Dep. Variable: Emp_diff R-squared: 0.021
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 7.638
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.00602
Time: 10:18:11 Log-Likelihood: -1255.0
No. Observations: 351 AIC: 2514.
Df Residuals: 349 BIC: 2522.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.4309 0.694 -2.062 0.040 -2.796 -0.066
gap 16.8358 6.092 2.764 0.006 4.854 28.817
==============================================================================
Omnibus: 39.631 Durbin-Watson: 2.149
Prob(Omnibus): 0.000 Jarque-Bera (JB): 104.133
Skew: -0.529 Prob(JB): 2.44e-23
Kurtosis: 5.450 Cond. No. 13.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The standard error of the regression is 8.67
Model 4
res4 = smf.ols("Emp_diff ~ gap + C(CHAIN) + CO_OWNED", data=rd).fit()
ser4 = res4.scale**0.5
print(res4.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
OLS Regression Results
==============================================================================
Dep. Variable: Emp_diff R-squared: 0.029
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 2.065
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.0693
Time: 10:18:11 Log-Likelihood: -1253.6
No. Observations: 351 AIC: 2519.
Df Residuals: 345 BIC: 2542.
Df Model: 5
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -1.1310 0.959 -1.180 0.239 -3.016 0.755
C(CHAIN)[T.2] 0.3326 1.287 0.258 0.796 -2.199 2.864
C(CHAIN)[T.3] -1.7904 1.317 -1.360 0.175 -4.380 0.799
C(CHAIN)[T.4] -0.2497 1.505 -0.166 0.868 -3.211 2.711
gap 16.0698 6.238 2.576 0.010 3.800 28.340
CO_OWNED 0.4834 1.096 0.441 0.660 -1.673 2.640
==============================================================================
Omnibus: 42.491 Durbin-Watson: 2.150
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.637
Skew: -0.567 Prob(JB): 3.48e-25
Kurtosis: 5.533 Cond. No. 15.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The standard error of the regression is 8.71
Model 5
res5 = smf.ols("Emp_diff ~ gap + C(CHAIN) + CO_OWNED + SOUTHJ+CENTRALJ+NORTHJ+PA1+PA2+SHORE", data=rd).fit()
ser5 = res5.scale**0.5
print(res5.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
OLS Regression Results
==============================================================================
Dep. Variable: Emp_diff R-squared: 0.050
Model: OLS Adj. R-squared: 0.022
Method: Least Squares F-statistic: 1.786
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.0619
Time: 10:18:11 Log-Likelihood: -1249.8
No. Observations: 351 AIC: 2522.
Df Residuals: 340 BIC: 2564.
Df Model: 10
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.8557 0.828 -1.034 0.302 -2.484 0.772
C(CHAIN)[T.2] 0.4422 1.303 0.339 0.734 -2.120 3.005
C(CHAIN)[T.3] -1.9907 1.332 -1.495 0.136 -4.610 0.629
C(CHAIN)[T.4] 0.1452 1.520 0.096 0.924 -2.844 3.135
gap 13.0286 7.431 1.753 0.080 -1.587 27.644
CO_OWNED 0.1365 1.114 0.123 0.902 -2.054 2.327
SOUTHJ 1.0201 1.072 0.952 0.342 -1.088 3.128
CENTRALJ -0.4718 1.168 -0.404 0.686 -2.768 1.825
NORTHJ 0.6560 0.827 0.793 0.428 -0.971 2.283
PA1 -3.1320 1.496 -2.093 0.037 -6.075 -0.189
PA2 1.0719 1.312 0.817 0.415 -1.510 3.654
SHORE -2.6303 1.692 -1.555 0.121 -5.958 0.698
==============================================================================
Omnibus: 43.744 Durbin-Watson: 2.180
Prob(Omnibus): 0.000 Jarque-Bera (JB): 115.730
Skew: -0.586 Prob(JB): 7.41e-26
Kurtosis: 5.558 Cond. No. 1.12e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.45e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
The standard error of the regression is 8.71
Putting the coefficients into one table
r1 = [ r.params.loc["C(STATE, Treatment(reference='PA'))[T.NJ]"] for r in [res1, res2]] + [np.nan]*3
r2 = [np.nan]*2 + [ r.params.loc["gap"] for r in [res3, res4, res5]]
r3 = ['no', 'yes', 'no', 'yes', 'yes']
r4 = ['no']*4+['yes']
r5 = [r.scale**0.5 for r in [res1, res2, res3, res4, res5]]
res_table = pd.DataFrame([r1, r2, r3, r4, r5],
index=['New Jersey dummy', 'Initial wage gap', 'Controls for chain and ownership', 'Controls for region', 'Standard error of regression'],
columns = ['(i)', '(ii)', '(iii)', '(iv)', '(v)'])
res_table.style.format(precision=2)
(i) | (ii) | (iii) | (iv) | (v) | |
---|---|---|---|---|---|
New Jersey dummy | 2.31 | 2.32 | nan | nan | nan |
Initial wage gap | nan | nan | 16.84 | 16.07 | 13.03 |
Controls for chain and ownership | no | yes | no | yes | yes |
Controls for region | no | no | no | no | yes |
Standard error of regression | 8.71 | 8.72 | 8.67 | 8.68 | 8.65 |
An alternative formulation
You may be more familiar with a different (but equivalent) formulation of the regression. It uses the level of the endogenous variable and a "double" fixed effect to compute the difference. In our case, this would be:
where \(t\) is equal to before
and after
since there are only two periods in our data.
To implement this, we need to reshape the data into a panel, then specify the new regression.
This idea, and the solution, is the work of Haotian Luo, MS 2024, lightly edited.
# Transform the dataframe
df = rd[['SHEET','STATE','FTE_before','FTE_after']]
df = df.set_index(['STATE','SHEET'])
df = df.stack().reset_index()
df.rename(columns = {'level_2' : 'time', 0 : 'employment'}, inplace = True)
# Create a time variable
df['time'] = df['time'].replace({'FTE_before' : 'before', 'FTE_after':'after' })
df.head()
STATE | SHEET | time | employment | |
---|---|---|---|---|
0 | PA | 56 | before | 34.0 |
1 | PA | 56 | after | 20.0 |
2 | PA | 61 | before | 24.0 |
3 | PA | 61 | after | 35.5 |
4 | PA | 445 | before | 70.5 |
# Regression
result = smf.ols("employment ~ C(time,Treatment(reference='before'))*C(STATE, Treatment(reference='PA'))", data=df).fit()
print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: employment R-squared: 0.009
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 2.045
Date: Thu, 07 Mar 2024 Prob (F-statistic): 0.106
Time: 10:18:12 Log-Likelihood: -2560.6
No. Observations: 702 AIC: 5129.
Df Residuals: 698 BIC: 5147.
Df Model: 3
Covariance Type: nonrobust
=============================================================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Intercept 23.7045 1.146 20.676 0.000 21.454 25.956
C(time, Treatment(reference='before'))[T.after] -1.8788 1.621 -1.159 0.247 -5.062 1.305
C(STATE, Treatment(reference='PA'))[T.NJ] -3.0614 1.272 -2.406 0.016 -5.559 -0.563
C(time, Treatment(reference='before'))[T.after]:C(STATE, Treatment(reference='PA'))[T.NJ] 2.3119 1.799 1.285 0.199 -1.221 5.845
==============================================================================
Omnibus: 218.752 Durbin-Watson: 1.352
Prob(Omnibus): 0.000 Jarque-Bera (JB): 879.029
Skew: 1.400 Prob(JB): 1.32e-191
Kurtosis: 7.713 Cond. No. 11.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Notice that the coefficient for C(time, Treatment(reference='before'))[T.after]:C(STATE, Treatment(reference='PA'))[T.NJ]
is 2.3119, the same as the coefficient for C(STATE, Treatment(reference='PA'))[T.NJ]
in the first model. The fixed effect for C(time, Treatment(reference='before'))[T.after]
is the same as the intercept in the first model.