download notebook
view notebook w/o solutions

Difference in differences

Let's practice our regression in python by replicating Card and Krueger (1994), "Minimum wages and employment," one of the most famous differnce-in-differences regressions in the literature.

The data are available from Card's website. Download the zip file and unzip it into your working directory.

The paper is also at Card's website.

There are several people on the web who have attempted this replication. I found Aaron Mamula's work useful.

Background

How does the minimum wage affect unemployment? Theory predicts that a binding minimum wage should decrease employment in occupations or worker types whose equilibrium wage would be less than the minimum wage. This paper looks at a change in the minimum wage in New Jersey from $4.25 to $5.05 in 1992. Pennsylvania, a neighboring state did not change its minimum wage. So the first difference is the before and after 1992 and second difference is PA vs NJ.

Card and Krueger collected data from fast-food restaurants before and after the policy change. Fast food is a good industry to study because they hire a lot of low-wage workers, including teenagers, who are mostly likely be affected by the change in policy.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
from pandas import option_context

import statsmodels.formula.api as smf 

Loading the data

Card and Krueger were using SAS. Getting the data into a DataFrame took a bit of hassle. I dug out the variable names from the codebook and created the list below.

For practice, you could try reading in the codebook and using readline() and the like to extract the variable names and create the list.

c = ['SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ','CENTRALJ','NORTHJ','PA1','PA2','SHORE','NCALLS','EMPFT','EMPPT',
'NMGRS' ,'WAGE_ST','INCTIME','FIRSTINC','BONUS','PCTAFF','MEALS','OPEN','HRSOPEN','PSODA','PFRY','PENTREE','NREGS',
'NREGS11','TYPE2','STATUS2','DATE2','NCALLS2','EMPFT2','EMPPT2','NMGRS2','WAGE_ST2','INCTIME2','FIRSTIN2','SPECIAL2',
'MEALS2','OPEN2R','HRSOPEN2','PSODA2','PFRY2','PENTREE2','NREGS2','NREGS112']

The data file is in fixed-width format. I'm using pd.read_fwf() to read in the file. Pandas will guess at the layout (how many columns define a variable), and it got it right this time. If pandas cannot figure it out, you can pass it a column spec. (docs)

I'm only keeping the columns I need. Then I'm recoding the state variable so I don't have to remember that NJ = 1.

ck = pd.read_fwf('njmin/public.dat', header=None, na_values='.', names=c)
ck = ck[['SHEET', 'STATE', 'EMPFT', 'EMPPT', 'EMPFT2', 'EMPPT2', 'NMGRS', 'NMGRS2', 
         'STATUS2', 'WAGE_ST', 'WAGE_ST2', 'CO_OWNED', 'CHAIN', 
         'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1', 'PA2', 'SHORE']]
ck['STATE'] = ck['STATE'].replace({1:'NJ', 0:'PA'})

ck.head(3)
SHEET STATE EMPFT EMPPT EMPFT2 EMPPT2 NMGRS NMGRS2 STATUS2 WAGE_ST WAGE_ST2 CO_OWNED CHAIN SOUTHJ CENTRALJ NORTHJ PA1 PA2 SHORE
0 46 PA 30.0 15.0 3.5 35.0 3.0 3.0 1 NaN 4.30 0 1 0 0 0 1 0 0
1 49 PA 6.5 6.5 0.0 15.0 4.0 4.0 1 NaN 4.45 0 2 0 0 0 1 0 0
2 506 PA 3.0 7.0 3.0 7.0 2.0 4.0 1 NaN 5.00 1 2 0 0 0 1 0 0

Create the full-time equivalent variable

The total employment variable is full-time employees, plus managers, plus one-half the part-time employees.

ck['FTE_before'] = ck['EMPFT']  + ck['NMGRS']  + 0.5*ck['EMPPT']
ck['FTE_after']  = ck['EMPFT2'] + ck['NMGRS2'] + 0.5*ck['EMPPT2']

CK's Table 3

Here is the northwest corner of table 3. Let's recreate it.

PA NJ NJ-PA
FTE before 23.33 20.44 -2.89
(1.35) (0.51) (1.44)
FTE after 21.17 21.03 -0.14
(0.94) (0.52) (1.07)
ck['NJ-PA'] = ck['FTE_before'] - ck['FTE_after']
t3 = ck[['STATE', 'FTE_before', 'FTE_after']].groupby('STATE').agg(['mean', 'sem']).transpose()
t3 = t3[['PA', 'NJ']]
t3['NJ-PA'] = t3['NJ'] - t3['PA']

# Fix the standard error of the mean differences
t3.loc[('FTE_before', 'sem'), 'NJ-PA'] = (t3.loc[('FTE_before', 'sem'), 'PA']**2 + t3.loc[('FTE_before', 'sem'), 'NJ']**2)**(0.5)
t3.loc[('FTE_after', 'sem'),  'NJ-PA'] = (t3.loc[('FTE_after', 'sem'),  'PA']**2 + t3.loc[('FTE_after',  'sem'), 'NJ']**2)**(0.5)

with option_context('display.precision', 2):
    display(t3)
STATE PA NJ NJ-PA
FTE_before mean 23.33 20.41 -2.92
sem 1.35 0.51 1.44
FTE_after mean 21.17 21.03 -0.14
sem 0.94 0.52 1.08

The difference between the PA and NJ stores actually shrinks after NJ raises its minimum wage. NJ stores are, on average, hiring more people and PA stores are hiring fewer people.

Interesting first cut of the data. The next step is to control for some other variables and see if the result holds up.

CK's Table 4, model 1

Here goes the first regression.

$$\Delta E_i = a + b \text{NJ}_i+\epsilon_i$$

where \(\Delta E_i\) is the change in employment (in store \(i\)) before and after the minimum wage change and NJ is equal to 1 if the store is in New Jersey and 0 otherwise.

# I'm trying to follow the sample selection given in the notes of table 4. I have 6 too few stores. Not sure why.
# employment>0 before AND employment>0 after AND reported wages before AND reported wages after
rd = ck[ (ck['FTE_after'].isna()==False) & (ck['FTE_before'].isna()==False) & (ck['WAGE_ST'].isna()==False) & (ck['WAGE_ST2'].isna()==False) ].copy()

rd['Emp_diff'] = rd['FTE_after'] - rd['FTE_before']
res1 = smf.ols("Emp_diff ~ C(STATE, Treatment(reference='PA'))", data=rd).fit()
ser1 = res1.scale**0.5
print(res1.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Emp_diff   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     3.773
Date:                Thu, 07 Mar 2024   Prob (F-statistic):             0.0529
Time:                        10:18:11   Log-Likelihood:                -1256.9
No. Observations:                 351   AIC:                             2518.
Df Residuals:                     349   BIC:                             2526.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=============================================================================================================
                                                coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------
Intercept                                    -1.8788      1.073     -1.752      0.081      -3.988       0.231
C(STATE, Treatment(reference='PA'))[T.NJ]     2.3119      1.190      1.942      0.053      -0.029       4.653
==============================================================================
Omnibus:                       39.407   Durbin-Watson:                   2.128
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              103.223
Skew:                          -0.527   Prob(JB):                     3.85e-23
Kurtosis:                       5.439   Cond. No.                         4.41
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The standard error of the regression is 8.71

Increasing the minimum wage did not cause employment to fall. That result earned this paper 5k google scholar citations.

My results are close, but not identical to those in the published paper. I don't have the sample selection correct. I suspect I could run the SAS code and figure it out...

Practice

  1. Try some of the other regressions in table 4. For models (iii)–(v) you will need to create the gap measure. It is defined on page 779.
    You likely won't get the coefficient exactly right—I haven't been able to match the sample correctly. But your numbers should be in the ball park.
  2. Once you have estimated a couple models, try putting the results into a table similar to table 4.

  3. CHAIN is the control for chain stores, it takes values 1, 2, 3, or 4.

  4. CO_OWNED is the control for company ownership (rather than franchise). It takes 0 or 1 as values.
  5. WAGE_ST is the wage in the "before" period.
  6. SOUTHJ, CENTRALJ, NORTHJ, PA1, PA2, SHORE are the controls for regions. Each takes 0 or 1 as values.

Model 2

res2 = smf.ols("Emp_diff ~ C(STATE, Treatment(reference='PA')) + C(CHAIN) + CO_OWNED", data=rd).fit()
ser2 = res2.scale**0.5
print(res2.summary())
print(f'\nThe standard error of the regression is {ser2:.2f}')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Emp_diff   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     1.483
Date:                Thu, 07 Mar 2024   Prob (F-statistic):              0.195
Time:                        10:18:11   Log-Likelihood:                -1255.1
No. Observations:                 351   AIC:                             2522.
Df Residuals:                     345   BIC:                             2545.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=============================================================================================================
                                                coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------
Intercept                                    -1.4080      1.209     -1.164      0.245      -3.787       0.971
C(STATE, Treatment(reference='PA'))[T.NJ]     2.3194      1.197      1.938      0.053      -0.034       4.673
C(CHAIN)[T.2]                                 0.1696      1.296      0.131      0.896      -2.380       2.719
C(CHAIN)[T.3]                                -2.1435      1.321     -1.623      0.106      -4.741       0.454
C(CHAIN)[T.4]                                -0.8225      1.491     -0.552      0.581      -3.754       2.109
CO_OWNED                                      0.3514      1.098      0.320      0.749      -1.809       2.512
==============================================================================
Omnibus:                       42.947   Durbin-Watson:                   2.134
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              113.927
Skew:                          -0.573   Prob(JB):                     1.82e-25
Kurtosis:                       5.545   Cond. No.                         5.75
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The standard error of the regression is 8.72

Model 3

rd['gap'] = 0
rd.loc[ (rd['WAGE_ST']<5.05) & (rd['STATE']=='NJ'), 'gap'] = (5.05-rd['WAGE_ST'])/rd['WAGE_ST']
res3 = smf.ols("Emp_diff ~ gap", data=rd).fit()
ser3 = res3.scale**0.5
print(res3.summary())
print(f'\nThe standard error of the regression is {ser3:.2f}')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Emp_diff   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     7.638
Date:                Thu, 07 Mar 2024   Prob (F-statistic):            0.00602
Time:                        10:18:11   Log-Likelihood:                -1255.0
No. Observations:                 351   AIC:                             2514.
Df Residuals:                     349   BIC:                             2522.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.4309      0.694     -2.062      0.040      -2.796      -0.066
gap           16.8358      6.092      2.764      0.006       4.854      28.817
==============================================================================
Omnibus:                       39.631   Durbin-Watson:                   2.149
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              104.133
Skew:                          -0.529   Prob(JB):                     2.44e-23
Kurtosis:                       5.450   Cond. No.                         13.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The standard error of the regression is 8.67

Model 4

res4 = smf.ols("Emp_diff ~ gap + C(CHAIN) + CO_OWNED", data=rd).fit()
ser4 = res4.scale**0.5
print(res4.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Emp_diff   R-squared:                       0.029
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     2.065
Date:                Thu, 07 Mar 2024   Prob (F-statistic):             0.0693
Time:                        10:18:11   Log-Likelihood:                -1253.6
No. Observations:                 351   AIC:                             2519.
Df Residuals:                     345   BIC:                             2542.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -1.1310      0.959     -1.180      0.239      -3.016       0.755
C(CHAIN)[T.2]     0.3326      1.287      0.258      0.796      -2.199       2.864
C(CHAIN)[T.3]    -1.7904      1.317     -1.360      0.175      -4.380       0.799
C(CHAIN)[T.4]    -0.2497      1.505     -0.166      0.868      -3.211       2.711
gap              16.0698      6.238      2.576      0.010       3.800      28.340
CO_OWNED          0.4834      1.096      0.441      0.660      -1.673       2.640
==============================================================================
Omnibus:                       42.491   Durbin-Watson:                   2.150
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              112.637
Skew:                          -0.567   Prob(JB):                     3.48e-25
Kurtosis:                       5.533   Cond. No.                         15.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The standard error of the regression is 8.71

Model 5

res5 = smf.ols("Emp_diff ~ gap + C(CHAIN) + CO_OWNED + SOUTHJ+CENTRALJ+NORTHJ+PA1+PA2+SHORE", data=rd).fit()
ser5 = res5.scale**0.5
print(res5.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Emp_diff   R-squared:                       0.050
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     1.786
Date:                Thu, 07 Mar 2024   Prob (F-statistic):             0.0619
Time:                        10:18:11   Log-Likelihood:                -1249.8
No. Observations:                 351   AIC:                             2522.
Df Residuals:                     340   BIC:                             2564.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.8557      0.828     -1.034      0.302      -2.484       0.772
C(CHAIN)[T.2]     0.4422      1.303      0.339      0.734      -2.120       3.005
C(CHAIN)[T.3]    -1.9907      1.332     -1.495      0.136      -4.610       0.629
C(CHAIN)[T.4]     0.1452      1.520      0.096      0.924      -2.844       3.135
gap              13.0286      7.431      1.753      0.080      -1.587      27.644
CO_OWNED          0.1365      1.114      0.123      0.902      -2.054       2.327
SOUTHJ            1.0201      1.072      0.952      0.342      -1.088       3.128
CENTRALJ         -0.4718      1.168     -0.404      0.686      -2.768       1.825
NORTHJ            0.6560      0.827      0.793      0.428      -0.971       2.283
PA1              -3.1320      1.496     -2.093      0.037      -6.075      -0.189
PA2               1.0719      1.312      0.817      0.415      -1.510       3.654
SHORE            -2.6303      1.692     -1.555      0.121      -5.958       0.698
==============================================================================
Omnibus:                       43.744   Durbin-Watson:                   2.180
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              115.730
Skew:                          -0.586   Prob(JB):                     7.41e-26
Kurtosis:                       5.558   Cond. No.                     1.12e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.45e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The standard error of the regression is 8.71

Putting the coefficients into one table

r1 = [ r.params.loc["C(STATE, Treatment(reference='PA'))[T.NJ]"] for r in [res1, res2]] + [np.nan]*3
r2 = [np.nan]*2 + [ r.params.loc["gap"] for r in [res3, res4, res5]] 
r3 = ['no', 'yes', 'no', 'yes', 'yes']
r4 = ['no']*4+['yes']
r5 = [r.scale**0.5 for r in [res1, res2, res3, res4, res5]]

res_table = pd.DataFrame([r1, r2, r3, r4, r5], 
             index=['New Jersey dummy', 'Initial wage gap', 'Controls for chain and ownership', 'Controls for region', 'Standard error of regression'],
             columns = ['(i)', '(ii)', '(iii)', '(iv)', '(v)'])

res_table.style.format(precision=2)
  (i) (ii) (iii) (iv) (v)
New Jersey dummy 2.31 2.32 nan nan nan
Initial wage gap nan nan 16.84 16.07 13.03
Controls for chain and ownership no yes no yes yes
Controls for region no no no no yes
Standard error of regression 8.71 8.72 8.67 8.68 8.65

An alternative formulation

You may be more familiar with a different (but equivalent) formulation of the regression. It uses the level of the endogenous variable and a "double" fixed effect to compute the difference. In our case, this would be:

$$\text{employment}_{it} = a + \beta \; \text{I}_{t=\text{after}} \times \text{I}_{\text{state}=\text{NJ}} + \epsilon_{it},$$

where \(t\) is equal to before and after since there are only two periods in our data.

To implement this, we need to reshape the data into a panel, then specify the new regression.

This idea, and the solution, is the work of Haotian Luo, MS 2024, lightly edited.

# Transform the dataframe 
df = rd[['SHEET','STATE','FTE_before','FTE_after']]
df = df.set_index(['STATE','SHEET'])
df = df.stack().reset_index()
df.rename(columns = {'level_2' : 'time', 0 : 'employment'}, inplace = True) 

# Create a time variable
df['time'] = df['time'].replace({'FTE_before' : 'before', 'FTE_after':'after' })
df.head()
STATE SHEET time employment
0 PA 56 before 34.0
1 PA 56 after 20.0
2 PA 61 before 24.0
3 PA 61 after 35.5
4 PA 445 before 70.5
# Regression
result = smf.ols("employment ~ C(time,Treatment(reference='before'))*C(STATE, Treatment(reference='PA'))", data=df).fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             employment   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     2.045
Date:                Thu, 07 Mar 2024   Prob (F-statistic):              0.106
Time:                        10:18:12   Log-Likelihood:                -2560.6
No. Observations:                 702   AIC:                             5129.
Df Residuals:                     698   BIC:                             5147.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=============================================================================================================================================================
                                                                                                coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                                                    23.7045      1.146     20.676      0.000      21.454      25.956
C(time, Treatment(reference='before'))[T.after]                                              -1.8788      1.621     -1.159      0.247      -5.062       1.305
C(STATE, Treatment(reference='PA'))[T.NJ]                                                    -3.0614      1.272     -2.406      0.016      -5.559      -0.563
C(time, Treatment(reference='before'))[T.after]:C(STATE, Treatment(reference='PA'))[T.NJ]     2.3119      1.799      1.285      0.199      -1.221       5.845
==============================================================================
Omnibus:                      218.752   Durbin-Watson:                   1.352
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              879.029
Skew:                           1.400   Prob(JB):                    1.32e-191
Kurtosis:                       7.713   Cond. No.                         11.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Notice that the coefficient for C(time, Treatment(reference='before'))[T.after]:C(STATE, Treatment(reference='PA'))[T.NJ] is 2.3119, the same as the coefficient for C(STATE, Treatment(reference='PA'))[T.NJ] in the first model. The fixed effect for C(time, Treatment(reference='before'))[T.after] is the same as the intercept in the first model.