download notebook
view notebook w/o solutions

Linear regression

files needed = ('sleep75.dta', 'wage1.dta')

This notebook introduces us to the statsmodels package (docs), which provides functions for formulating and estimating statistical models. This notebook will not address the models, per se, but will focus on how to put econometrics to work in python.

Many of you have used STATA before. STATA is a great package for econometrics. Python can do most of what STATA can do, but STATA will have more specialized routines available. As python's popularity grows the kinds of models you can estimate in grows, too.

If STATA is your thing, this page on Rob Hicks' website is a nice STATA to python concordance.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting


# The new packages...
import patsy                           # provides a syntax for specifying models  
import statsmodels.api as sm           # provides statistical models like ols, gmm, anova, etc...
import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

Reading Stata data files

Jeff Wooldridge's econometrics textbooks are an academic staple. Our plan today is to work through some of the problems in the Wooldridge textbook as a way to introduce regression in python.

On the plus side, the data that correspond to the Wooldridge problems are available to download and they are ALREADY CLEANED. [I contemplated adding some junk to the files to make it more interesting...]

On the minus side, the files are in STATA's .dta format.

Lucky for us, pandas has a method that reads stata files. It also has methods for SQL, SAS, JSON,...

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
sleep = pd.read_stata('sleep75.dta')

# Take a look...so clean!
sleep.head()

	age	case	educ	earns74	gdhlth	inlf	leis1	...	spwrk75	totwrk	worknrm	exper	yrsmarr	hrwage	agesq
0	32.0	1.0	12.0	0.0	0.0	1.0	3529.0	...	0.0	3438.0	3438.0	14.0	13.0	7.070004	1024.0
1	31.0	2.0	14.0	9500.0	1.0	1.0	2140.0	...	0.0	5020.0	5020.0	11.0	0.0	1.429999	961.0
2	44.0	3.0	17.0	42500.0	1.0	1.0	4595.0	...	1.0	2815.0	2815.0	21.0	0.0	20.530001	1936.0
3	30.0	4.0	12.0	42500.0	1.0	1.0	3211.0	...	1.0	3786.0	3786.0	12.0	12.0	9.619998	900.0
4	64.0	5.0	14.0	2500.0	1.0	1.0	4052.0	...	1.0	2580.0	2580.0	44.0	33.0	2.750000	4096.0

5 rows × 34 columns

sleep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 706 entries, 0 to 705
Data columns (total 34 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       706 non-null    float32
 1   black     706 non-null    float32
 2   case      706 non-null    float32
 3   clerical  706 non-null    float32
 4   construc  706 non-null    float32
 5   educ      706 non-null    float32
 6   earns74   706 non-null    float32
 7   gdhlth    706 non-null    float32
 8   inlf      706 non-null    float32
 9   leis1     706 non-null    float32
 10  leis2     706 non-null    float32
 11  leis3     706 non-null    float32
 12  smsa      706 non-null    float32
 13  lhrwage   532 non-null    float32
 14  lothinc   706 non-null    float32
 15  male      706 non-null    float32
 16  marr      706 non-null    float32
 17  prot      706 non-null    float32
 18  rlxall    706 non-null    float32
 19  selfe     706 non-null    float32
 20  sleep     706 non-null    float32
 21  slpnaps   706 non-null    float32
 22  south     706 non-null    float32
 23  spsepay   706 non-null    float32
 24  spwrk75   706 non-null    float32
 25  totwrk    706 non-null    float32
 26  union     706 non-null    float32
 27  worknrm   706 non-null    float32
 28  workscnd  706 non-null    float32
 29  exper     706 non-null    float32
 30  yngkid    706 non-null    float32
 31  yrsmarr   706 non-null    float32
 32  hrwage    532 non-null    float32
 33  agesq     706 non-null    float32
dtypes: float32(34)
memory usage: 99.3 KB

Directly specifying and estimating models with the formula.api

The statsmodels package provides us with a formulaic syntax for defining models that uses strings. The basic syntax is

y ~ x1 + x2

which describes the model

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$.

Notice that I did not specify the constant. Statsmodels takes care of that automatically.

The work flow is: 1. Specify the regression: sort out the dependent and independent variables 2. Create the model with statsmodel 3. Fit the model and obtain results

To do this, we use the statsmodels.formula.api methods, which we imported as smf.

1. Specify the regression

How do hours of sleep vary with working? Do we trade off sleep for work? We control for education and age.

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \epsilon. $$

[This is in problem 3, chapter 3 or Wooldrigde.]

2. Create the model

Using the statsmodel syntax, we have

sleep ~ totwrk + educ + age

Remember, the constant is automatically added.

We use the .ols() method of statsmodels. This is the ordinary least squares model.

sleep_model = smf.ols('sleep ~ totwrk + educ + age', data=sleep)
type(sleep_model)

statsmodels.regression.linear_model.OLS

The model object contains information about the regression model. Things like:

sleep_model.exog_names
sleep_model.endog_names
sleep_model.nobs

Check the documentation or try sleep_model. and then TAB.

sleep_model.exog_names

['Intercept', 'totwrk', 'educ', 'age']

3. Estimate the model

Step #2 set up the model, but did not estimate the coefficients. To estimate the model, we use the .fit() method of the OLS model object.

results = sleep_model.fit()
type(results)

statsmodels.regression.linear_model.RegressionResultsWrapper

Another object! This time, a RegressionResultsWrapper. This object hold all the, well, results. Try results. and TAB again to see what lives in there.

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.110
Method:                 Least Squares   F-statistic:                     29.92
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           3.28e-18
Time:                        14:41:36   Log-Likelihood:                -5263.1
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3638.2453    112.275     32.405      0.000    3417.810    3858.681
totwrk        -0.1484      0.017     -8.888      0.000      -0.181      -0.116
educ         -11.1338      5.885     -1.892      0.059     -22.687       0.420
age            2.1999      1.446      1.522      0.129      -0.639       5.038
==============================================================================
Omnibus:                       68.731   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              185.551
Skew:                          -0.496   Prob(JB):                     5.11e-41
Kurtosis:                       5.308   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The more you work, the less you sleep. I feel better already.

We can retrieve individual results from the RegressionResultsWrapper object.

print('The parameters are: \n', results.params, '\n')
print('The confidence intervals are:\n', results.conf_int(), '\n')
print('The r-sqared is:', results.rsquared)

The parameters are: 
 Intercept    3638.245312
totwrk         -0.148373
educ          -11.133813
age             2.199885
dtype: float64

The confidence intervals are:
                      0            1
Intercept  3417.810077  3858.680546
totwrk       -0.181149    -0.115598
educ        -22.687288     0.419661
age          -0.638561     5.038331

The r-sqared is: 0.11336395596679805

Doing it all at once

We can chain together the three steps into one line of code.

res = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.110
Method:                 Least Squares   F-statistic:                     29.92
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           3.28e-18
Time:                        14:41:36   Log-Likelihood:                -5263.1
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3638.2453    112.275     32.405      0.000    3417.810    3858.681
totwrk        -0.1484      0.017     -8.888      0.000      -0.181      -0.116
educ         -11.1338      5.885     -1.892      0.059     -22.687       0.420
age            2.1999      1.446      1.522      0.129      -0.639       5.038
==============================================================================
Omnibus:                       68.731   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              185.551
Skew:                          -0.496   Prob(JB):                     5.11e-41
Kurtosis:                       5.308   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Transforming data

logs

Logs get used a lot in econometrics. Regressing the log of a variable on the log of another variable means that the estimated coefficient is interpreted as an elasticity.

We use the np.log( ) method directly in the regression specification syntax. Note that we loaded the numpy package above as np. Let's modify our model and use the logarithm of age

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 \log(age) + \epsilon. $$

The notation is

sleep ~ totwrk + educ + np.log(age)

res = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.109
Method:                 Least Squares   F-statistic:                     29.79
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           3.91e-18
Time:                        14:41:36   Log-Likelihood:                -5263.3
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    3440.9308    239.448     14.370      0.000    2970.811    3911.050
totwrk         -0.1489      0.017     -8.924      0.000      -0.182      -0.116
educ          -11.3493      5.881     -1.930      0.054     -22.896       0.197
np.log(age)    79.2414     56.612      1.400      0.162     -31.908     190.391
==============================================================================
Omnibus:                       68.734   Durbin-Watson:                   1.944
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              184.854
Skew:                          -0.497   Prob(JB):                     7.24e-41
Kurtosis:                       5.301   Cond. No.                     3.61e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.61e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Wooldridge problem C2 in chapter 6.

Load wage1.dta
Scatter plot log(wage) against educ.

wage1 = pd.read_stata('wage1.dta')

wage1.head()

	wage	educ	exper	tenure	female	married	numdep	smsa	...	trade	services	clerocc	servocc	lwage	expersq	tenursq
0	3.10	11.0	2.0	0.0	1.0	0.0	2.0	1.0	...	0.0	0.0	0.0	0.0	1.131402	4.0	0.0
1	3.24	12.0	22.0	2.0	1.0	1.0	3.0	1.0	...	0.0	1.0	0.0	1.0	1.175573	484.0	4.0
2	3.00	11.0	2.0	0.0	0.0	0.0	2.0	0.0	...	1.0	0.0	0.0	0.0	1.098612	4.0	0.0
3	6.00	8.0	44.0	28.0	0.0	1.0	0.0	1.0	...	0.0	0.0	1.0	0.0	1.791759	1936.0	784.0
4	5.30	12.0	7.0	2.0	0.0	1.0	1.0	0.0	...	0.0	0.0	0.0	0.0	1.667707	49.0	4.0

5 rows × 24 columns

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(wage1['educ'], wage1['lwage'], marker='o', alpha = 0.5 )

ax.set_xlabel('education', fontsize=16)
ax.set_ylabel('log wage', fontsize=16)

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

png

Estimate
$$ \log(wage) = \beta_0 + \beta_1 educ + \epsilon$$

Name your results object prac_results.

prac_results = smf.ols('np.log(wage) ~ educ', data=wage1).fit()
print(prac_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     119.6
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           3.27e-25
Time:                        14:41:36   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.097      5.998      0.000       0.393       0.775
educ           0.0827      0.008     10.935      0.000       0.068       0.098
==============================================================================
Omnibus:                       11.804   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.811
Skew:                           0.268   Prob(JB):                      0.00100
Kurtosis:                       3.586   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

More Data transformations

Statsmodels can handle common (and less common) regression tasks. Here are a few more. You can always check the documentation for more!

Interacting variables

Use * to interact two variables. Patsy will also include the two variables individually. Let's interact education and age. The model is now

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 age\times educ + \epsilon. $$

which is

'sleep ~ totwrk + educ*age'

Notice that I did not include education and age separately.

res = smf.ols('sleep ~ totwrk + educ*age', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     23.67
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           2.24e-18
Time:                        14:41:36   Log-Likelihood:                -5260.9
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3081.4242    286.426     10.758      0.000    2519.069    3643.780
totwrk        -0.1444      0.017     -8.618      0.000      -0.177      -0.112
educ          31.5246     21.032      1.499      0.134      -9.769      72.818
age           14.9293      6.197      2.409      0.016       2.763      27.096
educ:age      -1.0065      0.477     -2.112      0.035      -1.942      -0.071
==============================================================================
Omnibus:                       66.086   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              178.298
Skew:                          -0.475   Prob(JB):                     1.92e-39
Kurtosis:                       5.271   Cond. No.                     4.32e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.32e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Fixed effects

When I was a kid, we called these dummy variables. Gender is coded {0,1} in the variable 'male'. Stats models uses the syntax C() where the C is a mnemonic for 'categorical.'

# Take a peek at 'male'. It's zeros and ones. 
sleep['male'].head()

0    1.0
1    1.0
2    1.0
3    0.0
4    1.0
Name: male, dtype: float32

By adding a fixed effect for gender, the model is

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 I(\text{male}=1) + \epsilon. $$

where $I(\text{male}=1)$ is an indicator function that is equal to one if 'male' is equal to one. The model syntax is

sleep ~ totwrk + educ + age + C(male)

res = smf.ols('sleep ~ totwrk + educ + age + C(male)', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                     24.26
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           8.02e-19
Time:                        14:41:36   Log-Likelihood:                -5259.8
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       3642.4666    111.844     32.567      0.000    3422.877    3862.056
C(male)[T.1.0]    87.9933     34.323      2.564      0.011      20.604     155.382
totwrk            -0.1658      0.018     -9.230      0.000      -0.201      -0.131
educ             -11.7561      5.866     -2.004      0.045     -23.274      -0.238
age                1.9643      1.443      1.361      0.174      -0.869       4.797
==============================================================================
Omnibus:                       65.308   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              174.107
Skew:                          -0.473   Prob(JB):                     1.56e-38
Kurtosis:                       5.241   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The 'T.1.0' notation is a bit confusing in this context. It means it is giving the value for coefficient on the 'male'=1.0 variable. Since we have included a constant, one of the categories gets dropped. In this case, the intercept is the 'male'=0 case.

To see things more clearly, let's recode the male variable.

sleep['gender'] = sleep['male'].replace({1.0:'male', 0.0:'female'})

res = smf.ols('sleep ~ totwrk + educ + age + C(gender)', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                     24.26
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           8.02e-19
Time:                        14:41:36   Log-Likelihood:                -5259.8
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept          3642.4666    111.844     32.567      0.000    3422.877    3862.056
C(gender)[T.male]    87.9933     34.323      2.564      0.011      20.604     155.382
totwrk               -0.1658      0.018     -9.230      0.000      -0.201      -0.131
educ                -11.7561      5.866     -2.004      0.045     -23.274      -0.238
age                   1.9643      1.443      1.361      0.174      -0.869       4.797
==============================================================================
Omnibus:                       65.308   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              174.107
Skew:                          -0.473   Prob(JB):                     1.56e-38
Kurtosis:                       5.241   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.

We will continue with the example from before (Wooldridge problem C2 in chapter 6).

Scatter plot the residuals (y axis) against education (x axis). The residuals are in the results object from the fit method.

In part 3. from the earlier exercise, I used

prac_results = smf.ols('np.log(wage) ~ educ', data=wage1).fit()

so the residuals are in prac_results.resid. Try inspecting prac_results.resid first. What kind of object is it?

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(wage1['educ'], prac_results.resid, marker='o', alpha = 0.5 )

ax.set_xlabel('education', fontsize=16)
ax.set_ylabel('residual', fontsize=16)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

png

Looks heteroskedastic. We can change the covariance matrix type (which will correct the standard error calculation) using the cov_type parameter (docs). The types of covariance matrices are described in the docs.

Try 'HC3' for your covariance matrix type.

Note: If you have not had econometrics yet, this probably doesn't make much sense to you. That's okay—just skip it.

prac_results_h3 = smf.ols('np.log(wage) ~ educ', data=wage1).fit(cov_type='HC3')
print(prac_results_h3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     111.7
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           8.48e-24
Time:                        14:41:37   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.099      5.872      0.000       0.389       0.779
educ           0.0827      0.008     10.569      0.000       0.067       0.098
==============================================================================
Omnibus:                       11.804   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.811
Skew:                           0.268   Prob(JB):                      0.00100
Kurtosis:                       3.586   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

# How does this differ from the estimation with non-robust (weak?) standard errors? 

print(prac_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     119.6
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           3.27e-25
Time:                        14:41:37   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.097      5.998      0.000       0.393       0.775
educ           0.0827      0.008     10.935      0.000       0.068       0.098
==============================================================================
Omnibus:                       11.804   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.811
Skew:                           0.268   Prob(JB):                      0.00100
Kurtosis:                       3.586   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Let's add some more regressors. Estimate
$$ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4I_m + \epsilon$$

where $I_m$ is a variable equal to 1 if the worker is a married.

prac_results_extended = smf.ols('np.log(wage) ~ educ + exper + np.power(exper,2) + C(married)', data=wage1).fit(cov_type='HC3')
print(prac_results_extended.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.310
Model:                            OLS   Adj. R-squared:                  0.304
Method:                 Least Squares   F-statistic:                     54.50
Date:                Sun, 27 Nov 2022   Prob (F-statistic):           2.22e-38
Time:                        14:41:37   Log-Likelihood:                -315.96
No. Observations:                 526   AIC:                             641.9
Df Residuals:                     521   BIC:                             663.3
Df Model:                           4                                         
Covariance Type:                  HC3                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.1427      0.108      1.316      0.188      -0.070       0.355
C(married)[T.1.0]      0.1192      0.044      2.723      0.006       0.033       0.205
educ                   0.0877      0.008     11.019      0.000       0.072       0.103
exper                  0.0351      0.005      6.448      0.000       0.024       0.046
np.power(exper, 2)    -0.0006      0.000     -5.298      0.000      -0.001      -0.000
==============================================================================
Omnibus:                        5.419   Durbin-Watson:                   1.792
Prob(Omnibus):                  0.067   Jarque-Bera (JB):                7.126
Skew:                           0.046   Prob(JB):                       0.0284
Kurtosis:                       3.563   Cond. No.                     4.25e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 4.25e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Plotting the regression line

Recall that seaborn gives us an easy way to compute a regplot but will not give us the coefficients of the regression line. Here, we explore how to compute the regression and plot the line. The pros: we know exactly what regression we are using and we can control it. Cons: its more coding. This is the usual tradeoff of doing more work to ensure that we understand exactly what is being done.

In a two-variable regression (or in a scatter plot) we can plot the regression line. We use the data and the estimated coefficients. The estimated coefficients are in the results object params.

If you finish early, try these next two problems on your own (don't look at my code). We will go through them in class together.

Scatter plot the data (this is the same as part 2.) and add the regression line for the model

$$log(y) = \beta_0 + \beta_1 edu + \epsilon.$$

Again, this is the regression from part 3. You might still have the results in prac_results.

To plot the regression line you will need to create some x data and then apply the parameters. I used something like this

y = [p.Intercept + p.educ*i for i in x]

where p hold the parameters from my results and x holds a few x data points.

# Just in case we changed some stuff above, re-run the regression.
prac_results = smf.ols('np.log(wage) ~ educ', data=wage1).fit()

# Create the line of best fit to plot
p = prac_results.params                                # params from the model fit
print(p)
print(type(p))

Intercept    0.583773
educ         0.082744
dtype: float64
<class 'pandas.core.series.Series'>

# I'm drawing a line, so I only need two points. 
x = [wage1['educ'].min(), wage1['educ'].max()]

y = [p.Intercept + p.educ*i for i in x]

fig, ax = plt.subplots(figsize=(10,6))

# Plot the data
ax.scatter(wage1['educ'], wage1['lwage'], marker='o', alpha = 0.5 )

# Plot the regression line.
ax.plot(x,y, color='black')

ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

png

Add the equation for the regression line to the figure.

fig, ax = plt.subplots(figsize=(10,6))

# Plot the data
ax.scatter(wage1['educ'], wage1['lwage'], marker='o', alpha = 0.5 )
# Plot the regression line.
ax.plot(x,y, color='black')

# build the string
text = 'regression line: \nlog(w) = {0:.2} + {1:.2} educ'.format(p.Intercept, p.educ)

ax.text(0.5, 2.5, text, fontsize=14)

ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

png