download notebook
view notebook w/ solutions

Linear regression

files needed = ('sleep75.dta', 'wage1.dta')

This notebook introduces us to the statsmodels package (docs), which provides functions for formulating and estimating statistical models. This notebook will not address the models, per se, but will focus on how to put econometrics to work in python.

Many of you have used STATA before. STATA is a great package for econometrics. Python can do most of what STATA can do, but STATA will have more specialized routines available. As python's popularity grows the kinds of models you can estimate in grows, too.

If STATA is your thing, this page on Rob Hicks' website is a nice STATA to python concordance.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting


# The new packages...
import patsy                           # provides a syntax for specifying models  
import statsmodels.api as sm           # provides statistical models like ols, gmm, anova, etc...
import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

Reading Stata data files

Jeff Wooldridge's econometrics textbooks are an academic staple. Our plan today is to work through some of the problems in the Wooldridge textbook as a way to introduce regression in python.

On the plus side, the data that correspond to the Wooldridge problems are available to download and they are ALREADY CLEANED. [I contemplated adding some junk to the files to make it more interesting...]

On the minus side, the files are in STATA's .dta format.

Lucky for us, pandas has a method that reads stata files. It also has methods for SQL, SAS, JSON,...

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
sleep = pd.read_stata('sleep75.dta')

# Take a look...so clean!
sleep.head()
sleep.info()

Directly specifying and estimating models with the formula.api

The statsmodels package provides us with a formulaic syntax for defining models that uses strings. The basic syntax is

y ~ x1 + x2

which describes the model

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\).

Notice that I did not specify the constant. Statsmodels takes care of that automatically.

The work flow is: 1. Specify the regression: sort out the dependent and independent variables 2. Create the model with statsmodel 3. Fit the model and obtain results

To do this, we use the statsmodels.formula.api methods, which we imported as smf.

1. Specify the regression

How do hours of sleep vary with working? Do we trade off sleep for work? We control for education and age.

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \epsilon. $$

[This is in problem 3, chapter 3 or Wooldrigde.]

2. Create the model

Using the statsmodel syntax, we have

sleep ~ totwrk + educ + age

Remember, the constant is automatically added.

We use the .ols() method of statsmodels. This is the ordinary least squares model.

sleep_model = smf.ols('sleep ~ totwrk + educ + age', data=sleep)
type(sleep_model)

The model object contains information about the regression model. Things like:

  • sleep_model.exog_names
  • sleep_model.endog_names
  • sleep_model.nobs

Check the documentation or try sleep_model. and then TAB.

sleep_model.exog_names

3. Estimate the model

Step #2 set up the model, but did not estimate the coefficients. To estimate the model, we use the .fit() method of the OLS model object.

results = sleep_model.fit()
type(results)

Another object! This time, a RegressionResultsWrapper. This object hold all the, well, results. Try results. and TAB again to see what lives in there.

print(results.summary())

The more you work, the less you sleep. I feel better already.

We can retrieve individual results from the RegressionResultsWrapper object.

print('The parameters are: \n', results.params, '\n')
print('The confidence intervals are:\n', results.conf_int(), '\n')
print('The r-sqared is:', results.rsquared)

Doing it all at once

We can chain together the three steps into one line of code.

res = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(res.summary())

Transforming data

logs

Logs get used a lot in econometrics. Regressing the log of a variable on the log of another variable means that the estimated coefficient is interpreted as an elasticity.

We use the np.log( ) method directly in the regression specification syntax. Note that we loaded the numpy package above as np. Let's modify our model and use the logarithm of age

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 \log(age) + \epsilon. $$

The notation is

sleep ~ totwrk + educ + np.log(age)
res = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(res.summary())

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Wooldridge problem C2 in chapter 6.

  1. Load wage1.dta
  2. Scatter plot log(wage) against educ.

  3. Estimate

    $$ \log(wage) = \beta_0 + \beta_1 educ + \epsilon$$

Name your results object prac_results.

More Data transformations

Statsmodels can handle common (and less common) regression tasks. Here are a few more. You can always check the documentation for more!

Interacting variables

Use * to interact two variables. Patsy will also include the two variables individually. Let's interact education and age. The model is now

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 age\times educ + \epsilon. $$

which is

'sleep ~ totwrk + educ*age'

Notice that I did not include education and age separately.

res = smf.ols('sleep ~ totwrk + educ*age', data=sleep).fit()
print(res.summary())

Fixed effects

When I was a kid, we called these dummy variables. Gender is coded {0,1} in the variable 'male'. Stats models uses the syntax C() where the C is a mnemonic for 'categorical.'

# Take a peek at 'male'. It's zeros and ones. 
sleep['male'].head()

By adding a fixed effect for gender, the model is

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 I(\text{male}=1) + \epsilon. $$

where \(I(\text{male}=1)\) is an indicator function that is equal to one if 'male' is equal to one. The model syntax is

sleep ~ totwrk + educ + age + C(male)
res = smf.ols('sleep ~ totwrk + educ + age + C(male)', data=sleep).fit()
print(res.summary())

The 'T.1.0' notation is a bit confusing in this context. It means it is giving the value for coefficient on the 'male'=1.0 variable. Since we have included a constant, one of the categories gets dropped. In this case, the intercept is the 'male'=0 case.

To see things more clearly, let's recode the male variable.

sleep['gender'] = sleep['male'].replace({1.0:'male', 0.0:'female'})

res = smf.ols('sleep ~ totwrk + educ + age + C(gender)', data=sleep).fit()
print(res.summary())

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.

We will continue with the example from before (Wooldridge problem C2 in chapter 6).

  1. Scatter plot the residuals (y axis) against education (x axis). The residuals are in the results object from the fit method.

In part 3. from the earlier exercise, I used

prac_results = smf.ols('np.log(wage) ~ educ', data=wage1).fit()

so the residuals are in prac_results.resid. Try inspecting prac_results.resid first. What kind of object is it?

  1. Looks heteroskedastic. We can change the covariance matrix type (which will correct the standard error calculation) using the cov_type parameter (docs). The types of covariance matrices are described in the docs.

Try 'HC3' for your covariance matrix type.

Note: If you have not had econometrics yet, this probably doesn't make much sense to you. That's okay—just skip it.

  1. Let's add some more regressors. Estimate
    $$ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4I_m + \epsilon$$

where \(I_m\) is a variable equal to 1 if the worker is a married.

Plotting the regression line

Recall that seaborn gives us an easy way to compute a regplot but will not give us the coefficients of the regression line. Here, we explore how to compute the regression and plot the line. The pros: we know exactly what regression we are using and we can control it. Cons: its more coding. This is the usual tradeoff of doing more work to ensure that we understand exactly what is being done.

In a two-variable regression (or in a scatter plot) we can plot the regression line. We use the data and the estimated coefficients. The estimated coefficients are in the results object params.

If you finish early, try these next two problems on your own (don't look at my code). We will go through them in class together.

  1. Scatter plot the data (this is the same as part 2.) and add the regression line for the model
$$log(y) = \beta_0 + \beta_1 edu + \epsilon.$$

Again, this is the regression from part 3. You might still have the results in prac_results.

To plot the regression line you will need to create some x data and then apply the parameters. I used something like this

y = [p.Intercept + p.educ*i for i in x]

where p hold the parameters from my results and x holds a few x data points.

# Just in case we changed some stuff above, re-run the regression.
prac_results = smf.ols('np.log(wage) ~ educ', data=wage1).fit()
# Create the line of best fit to plot
p = prac_results.params                                # params from the model fit
print(p)
print(type(p))
# I'm drawing a line, so I only need two points. 
x = [wage1['educ'].min(), wage1['educ'].max()]

y = [p.Intercept + p.educ*i for i in x]    
fig, ax = plt.subplots(figsize=(10,6))

# Plot the data
ax.scatter(wage1['educ'], wage1['lwage'], marker='o', alpha = 0.5 )

# Plot the regression line.
ax.plot(x,y, color='black')

ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
  1. Add the equation for the regression line to the figure.
fig, ax = plt.subplots(figsize=(10,6))

# Plot the data
ax.scatter(wage1['educ'], wage1['lwage'], marker='o', alpha = 0.5 )
# Plot the regression line.
ax.plot(x,y, color='black')

# build the string
text = 'regression line: \nlog(w) = {0:.2} + {1:.2} educ'.format(p.Intercept, p.educ)

ax.text(0.5, 2.5, text, fontsize=14)

ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)