download notebook
view notebook w/ solutions

Difference in differences

Let's practice our regression in python by replicating Card and Krueger (1994), "Minimum wages and employment," one of the most famous differnce-in-differences regressions in the literature.

The data are available from Card's website. Download the zip file and unzip it into your working directory.

The paper is also at Card's website.

There are several people on the web who have attempted this replication. I found Aaron Mamula's work useful.

Background

How does the minimum wage affect unemployment? Theory predicts that a binding minimum wage should decrease employment in occupations or worker types whose equilibrium wage would be less than the minimum wage. This paper looks at a change in the minimum wage in New Jersey from $4.25 to $5.05 in 1992. Pennsylvania, a neighboring state did not change its minimum wage. So the first difference is the before and after 1992 and second difference is PA vs NJ.

Card and Krueger collected data from fast-food restaurants before and after the policy change. Fast food is a good industry to study because they hire a lot of low-wage workers, including teenagers, who are mostly likely be affected by the change in policy.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
from pandas import option_context

import statsmodels.formula.api as smf

Loading the data

Card and Krueger were using SAS. Getting the data into a DataFrame took a bit of hassle. I dug out the variable names from the codebook and created the list below.

For practice, you could try reading in the codebook and using readline() and the like to extract the variable names and create the list.

c = ['SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ','CENTRALJ','NORTHJ','PA1','PA2','SHORE','NCALLS','EMPFT','EMPPT',
'NMGRS' ,'WAGE_ST','INCTIME','FIRSTINC','BONUS','PCTAFF','MEALS','OPEN','HRSOPEN','PSODA','PFRY','PENTREE','NREGS',
'NREGS11','TYPE2','STATUS2','DATE2','NCALLS2','EMPFT2','EMPPT2','NMGRS2','WAGE_ST2','INCTIME2','FIRSTIN2','SPECIAL2',
'MEALS2','OPEN2R','HRSOPEN2','PSODA2','PFRY2','PENTREE2','NREGS2','NREGS112']

The data file is in fixed-width format. I'm using pd.read_fwf() to read in the file. Pandas will guess at the layout (how many columns define a variable), and it got it right this time. If pandas cannot figure it out, you can pass it a column spec. (docs)

I'm only keeping the columns I need. Then I'm recoding the state variable so I don't have to remember that NJ = 1.

ck = pd.read_fwf('njmin/public.dat', header=None, na_values='.', names=c)
ck = ck[['SHEET', 'STATE', 'EMPFT', 'EMPPT', 'EMPFT2', 'EMPPT2', 'NMGRS', 'NMGRS2', 
         'STATUS2', 'WAGE_ST', 'WAGE_ST2', 'CO_OWNED', 'CHAIN', 
         'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1', 'PA2', 'SHORE']]
ck['STATE'] = ck['STATE'].replace({1:'NJ', 0:'PA'})

ck.head(3)

Create the full-time equivalent variable

The total employment variable is full-time employees, plus managers, plus one-half the part-time employees.

ck['FTE_before'] = ck['EMPFT']  + ck['NMGRS']  + 0.5*ck['EMPPT']
ck['FTE_after']  = ck['EMPFT2'] + ck['NMGRS2'] + 0.5*ck['EMPPT2']

CK's Table 3

Here is the northwest corner of table 3. Let's recreate it.

	PA	NJ	NJ-PA
FTE before	23.33	20.44	-2.89
	(1.35)	(0.51)	(1.44)
FTE after	21.17	21.03	-0.14
	(0.94)	(0.52)	(1.07)

ck['NJ-PA'] = ck['FTE_before'] - ck['FTE_after']
t3 = ck[['STATE', 'FTE_before', 'FTE_after']].groupby('STATE').agg(['mean', 'sem']).transpose()
t3 = t3[['PA', 'NJ']]
t3['NJ-PA'] = t3['NJ'] - t3['PA']

# Fix the standard error of the mean differences
t3.loc[('FTE_before', 'sem'), 'NJ-PA'] = (t3.loc[('FTE_before', 'sem'), 'PA']**2 + t3.loc[('FTE_before', 'sem'), 'NJ']**2)**(0.5)
t3.loc[('FTE_after', 'sem'),  'NJ-PA'] = (t3.loc[('FTE_after', 'sem'),  'PA']**2 + t3.loc[('FTE_after',  'sem'), 'NJ']**2)**(0.5)

with option_context('display.precision', 2):
    display(t3)

The difference between the PA and NJ stores actually shrinks after NJ raises its minimum wage. NJ stores are, on average, hiring more people and PA stores are hiring fewer people.

Interesting first cut of the data. The next step is to control for some other variables and see if the result holds up.

CK's Table 4, model 1

Here goes the first regression.

$$\Delta E_i = a + b \text{NJ}_i+\epsilon_i$$

where $\Delta E_i$ is the change in employment (in store $i$) before and after the minimum wage change and NJ is equal to 1 if the store is in New Jersey and 0 otherwise.

# I'm trying to follow the sample selection given in the notes of table 4. I have 6 too few stores. Not sure why.
# employment>0 before AND employment>0 after AND reported wages before AND reported wages after
rd = ck[ (ck['FTE_after'].isna()==False) & (ck['FTE_before'].isna()==False) & (ck['WAGE_ST'].isna()==False) & (ck['WAGE_ST2'].isna()==False) ].copy()

rd['Emp_diff'] = rd['FTE_after'] - rd['FTE_before']

res1 = smf.ols("Emp_diff ~ C(STATE, Treatment(reference='PA'))", data=rd).fit()
ser1 = res1.scale**0.5
print(res1.summary())
print(f'\nThe standard error of the regression is {ser1:.2f}')

Increasing the minimum wage did not cause employment to fall. That result earned this paper 5k google scholar citations.

My results are close, but not identical to those in the published paper. I don't have the sample selection correct. I suspect I could run the SAS code and figure it out...

Practice

Try some of the other regressions in table 4. For models (iii)–(v) you will need to create the gap measure. It is defined on page 779.
You likely won't get the coefficient exactly right—I haven't been able to match the sample correctly. But your numbers should be in the ball park.
Once you have estimated a couple models, try putting the results into a table similar to table 4.
CHAIN is the control for chain stores, it takes values 1, 2, 3, or 4.
CO_OWNED is the control for company ownership (rather than franchise). It takes 0 or 1 as values.
WAGE_ST is the wage in the "before" period.
SOUTHJ, CENTRALJ, NORTHJ, PA1, PA2, SHORE are the controls for regions. Each takes 0 or 1 as values.

Model 2

Model 3

Model 4

Model 5

Putting the coefficients into one table

An alternative formulation

You may be more familiar with a different (but equivalent) formulation of the regression. It uses the level of the endogenous variable and a "double" fixed effect to compute the difference. In our case, this would be:

$$\text{employment}_{it} = a + \beta \; \text{I}_{t=\text{after}} \times \text{I}_{\text{state}=\text{NJ}} + \epsilon_{it},$$

where $t$ is equal to before and after since there are only two periods in our data.

To implement this, we need to reshape the data into a panel, then specify the new regression.

This idea, and the solution, is the work of Haotian Luo, MS 2024, lightly edited.

# Transform the dataframe 
df = rd[['SHEET','STATE','FTE_before','FTE_after']]
df = df.set_index(['STATE','SHEET'])
df = df.stack().reset_index()
df.rename(columns = {'level_2' : 'time', 0 : 'employment'}, inplace = True) 

# Create a time variable
df['time'] = df['time'].replace({'FTE_before' : 'before', 'FTE_after':'after' })
df.head()

# Regression
result = smf.ols("employment ~ C(time,Treatment(reference='before'))*C(STATE, Treatment(reference='PA'))", data=df).fit()
print(result.summary())

Notice that the coefficient for C(time, Treatment(reference='before'))[T.after]:C(STATE, Treatment(reference='PA'))[T.NJ] is 2.3119, the same as the coefficient for C(STATE, Treatment(reference='PA'))[T.NJ] in the first model. The fixed effect for C(time, Treatment(reference='before'))[T.after] is the same as the intercept in the first model.