download notebook
view notebook w/o solutions

Regression with discrete dependent variables

files needed = ('pntsprd.dta', 'apple.dta')

We continue to learn about the statsmodels package (docs), which provides functions for formulating and estimating statistical models. This notebook will not address the models, per se, but will focus on how to take what you learned in econometrics class and use it in python.

In this notebook, we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize), but many of the techniques we demo here can be extended to dependent variables with a discrete number of values.

Here is a nice overview of the discrete choice models in statsmodels.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Reading Stata data files

We will work on a problem from Wooldridge's textbook in econometrics. We read the (mercifully) cleaned data files using the pandas method .read_stata( ) that reads stata files.

The file 'pntsprd.dta' contains data about Vegas betting. The complete variable list is here. We will use favwin which is equal to 1 if the favored team won and zero otherwise and spread which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.

For example, if Team A if favored by 3 points over Team B, then Team A must win by more than 3 points for the bet on Team A to pay off.

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('/content/drive/MyDrive/Data Class/6-logit/pntsprd.dta')

# Take a look...so clean!
vegas.head()

	favscr	undscr	spread	favhome	fav25	fregion	uregion	scrdiff	sprdcvr	favwin
0	72.0	61.0	7.0	0.0	1.0	3.0	4.0	11.0	1.0	1.0
1	82.0	74.0	7.0	1.0	0.0	3.0	1.0	8.0	1.0	1.0
2	87.0	57.0	17.0	1.0	0.0	3.0	3.0	30.0	1.0	1.0
3	69.0	70.0	9.0	1.0	0.0	3.0	3.0	-1.0	0.0	0.0
4	77.0	79.0	2.5	0.0	0.0	2.0	3.0	-2.0	0.0	0.0

vegas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 553 entries, 0 to 552
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   favscr   553 non-null    float32
 1   undscr   553 non-null    float32
 2   spread   553 non-null    float32
 3   favhome  553 non-null    float32
 4   neutral  553 non-null    float32
 5   fav25    553 non-null    float32
 6   und25    553 non-null    float32
 7   fregion  553 non-null    float32
 8   uregion  553 non-null    float32
 9   scrdiff  553 non-null    float32
 10  sprdcvr  553 non-null    float32
 11  favwin   553 non-null    float32
dtypes: float32(12)
memory usage: 30.2 KB

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue')

ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')

sea.despine(ax=ax)

png

OLS

We begin with the linear probability model. The model is

$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$

There is nothing new here technique-wise. We are estimating this with ols, see the ols notebook for a refresher.

# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()

print(res_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 favwin   R-squared:                       0.111
Model:                            OLS   Adj. R-squared:                  0.109
Method:                 Least Squares   F-statistic:                     68.57
Date:                Wed, 29 Nov 2023   Prob (F-statistic):           9.32e-16
Time:                        15:38:46   Log-Likelihood:                -279.29
No. Observations:                 553   AIC:                             562.6
Df Residuals:                     551   BIC:                             571.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5769      0.028     20.434      0.000       0.521       0.632
spread         0.0194      0.002      8.281      0.000       0.015       0.024
==============================================================================
Omnibus:                       86.055   Durbin-Watson:                   2.112
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               94.402
Skew:                          -0.956   Prob(JB):                     3.17e-21
Kurtosis:                       2.336   Cond. No.                         20.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one!

We can use the fittedvalues attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look.

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], res_ols.fittedvalues,  facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')

sea.despine(ax=ax, trim=True)

png

Logistic regression (logit)

The logistic regression passes the linear model through a non-linear function that constrains the output to lie between zero and one. (These functions are cumulative distribution functions.) In the logistic case, the function looks like

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$

and we predict a team wins when ever $\text{prob} \ge 0.5$.

We estimate the logit model with the logit( ) method from smf in a way similar to ols.

res_log = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_log.summary())

Optimization terminated successfully.
         Current function value: 0.477218
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 favwin   No. Observations:                  553
Model:                          Logit   Df Residuals:                      551
Method:                           MLE   Df Model:                            1
Date:                Wed, 29 Nov 2023   Pseudo R-squ.:                  0.1283
Time:                        15:38:47   Log-Likelihood:                -263.90
converged:                       True   LL-Null:                       -302.75
Covariance Type:            nonrobust   LLR p-value:                 1.201e-18
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0712      0.173     -0.411      0.681      -0.411       0.268
spread         0.1632      0.023      7.236      0.000       0.119       0.207
==============================================================================

Interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression

$$\text{prob} = \frac{\exp \left({\hat{\beta}_0+\hat{\beta}_1 spread}\right)}{1+\exp \left({\hat{\beta}_0+\hat{\beta}_1 spread}\right)}$$

The res_log.fittedvalues holds the fitted value of $\hat{\beta}_0+\hat{\beta}_1 spread$ and not the estimated probability. Let's compute it using the exp method of numpy.

res_log.fittedvalues

0      1.071426
1      1.071426
2      2.703687
3      1.397878
4      0.336908
         ...   
548    0.336908
549    1.316265
550    2.214008
551    0.663360
552    0.418521
Length: 553, dtype: float64

pred_probs = np.exp(res_log.fittedvalues) / ( 1+np.exp(res_log.fittedvalues) )
pred_probs.describe()

count    553.000000
mean       0.763110
std        0.151419
min        0.523001
25%        0.622491
50%        0.774632
75%        0.901500
max        0.998157
dtype: float64

Plot the estimated probabilty of the favored team winning and the actual data.

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_probs,  facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a logit model')

ax.legend(frameon=False, fontsize=16)
sea.despine(ax=ax, trim=True)

png

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Load the data 'apple.dta'. The data dictionary can be found here. The variable ecolbs is purchases of eco-friendly apples (whatever that means).

apples = pd.read_stata('/content/drive/MyDrive/Data Class/6-logit/apple.dta')
apples.head()

	id	educ	date	state	regprc	ecoprc	inseason	hhsize	male	faminc	age	reglbs	ecolbs	num5_17	num18_64
0	10002	16	111597	SD	1.19	1.19	1	4	0	45	43	2.0	2.000000	1	3
1	10004	16	121897	KS	0.59	0.79	0	1	0	65	37	0.0	2.000000	0	1
2	10034	18	111097	MI	0.59	0.99	1	3	0	65	44	0.0	2.666667	2	1
3	10035	12	111597	TN	0.89	1.09	1	2	1	55	55	3.0	0.000000	0	2
4	10039	15	122997	NY	0.89	1.09	0	1	1	25	22	0.0	3.000000	0	1

Create a variable named ecobuy that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0). This will be our left-hand side variable.

Remember that

apples['ecobuy'] = (apples['ecolbs']>0)

Will create a variable of `True` and `False`. Then you can use `.astype(int)` to convert your True/False to zero/one.

# The comparison returns a boolean value.
# Convert the boolean to an integer.
# This takes advantage of the fact that bools are True==1 and False==0.
apples['ecobuy'] = (apples['ecolbs'] > 0).astype(int)

apples['ecobuy'].unique()

array([1, 0])

Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics.

$$\text{ecobuy} = \beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} + \epsilon$$

apple_ols = smf.ols('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 ecobuy   R-squared:                       0.110
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     13.43
Date:                Wed, 29 Nov 2023   Prob (F-statistic):           2.18e-14
Time:                        15:38:48   Log-Likelihood:                -419.60
No. Observations:                 660   AIC:                             853.2
Df Residuals:                     653   BIC:                             884.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4237      0.165      2.568      0.010       0.100       0.748
ecoprc        -0.8026      0.109     -7.336      0.000      -1.017      -0.588
regprc         0.7193      0.132      5.464      0.000       0.461       0.978
faminc         0.0006      0.001      1.042      0.298      -0.000       0.002
hhsize         0.0238      0.013      1.902      0.058      -0.001       0.048
educ           0.0248      0.008      2.960      0.003       0.008       0.041
age           -0.0005      0.001     -0.401      0.689      -0.003       0.002
==============================================================================
Omnibus:                     4015.360   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               69.344
Skew:                          -0.411   Prob(JB):                     8.75e-16
Kurtosis:                       1.641   Cond. No.                         724.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

How many estimated probabilities are negative? Are greater than one?

fitted = apple_ols.fittedvalues   # store the fitted values
fitted[(fitted>1) | (fitted<0)]   # greater than 1 or less than zero

167    1.070860
493    1.054372
dtype: float64

Now estimate the model as a probit. A probit is similar to a logit in that we are passing the linear model through a nonlinear function. In this case, the nonlinear function is the cumulative density function of the normal distribution.

$$\text{Pr}(\text{ecobuy}=1 \mid X) = \Phi \left(\beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} \right),$$

where $\Phi( )$ is the CDF of the normal distribution. Try smf.probit( ).

apple_pro = smf.probit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_pro.summary())

Optimization terminated successfully.
         Current function value: 0.604599
         Iterations 5
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                 ecobuy   No. Observations:                  660
Model:                         Probit   Df Residuals:                      653
Method:                           MLE   Df Model:                            6
Date:                Wed, 29 Nov 2023   Pseudo R-squ.:                 0.08664
Time:                        15:38:48   Log-Likelihood:                -399.04
converged:                       True   LL-Null:                       -436.89
Covariance Type:            nonrobust   LLR p-value:                 2.751e-14
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2438      0.474     -0.514      0.607      -1.173       0.685
ecoprc        -2.2669      0.321     -7.052      0.000      -2.897      -1.637
regprc         2.0302      0.382      5.318      0.000       1.282       2.778
faminc         0.0014      0.002      0.932      0.351      -0.002       0.004
hhsize         0.0691      0.037      1.893      0.058      -0.002       0.141
educ           0.0714      0.024      2.939      0.003       0.024       0.119
age           -0.0012      0.004     -0.340      0.734      -0.008       0.006
==============================================================================

Compute the marginal effects of the coefficients at the means and print them out using summary(). You can get the marginal effects from the results object using .get_margeff(at='mean) (docs).

The marginal effect tells us the marginal change in predicted probability as the independent variables change. What does the marginal effect on ecoprc (the price of eco apples) tell us?

probit_marg = apple_pro.get_margeff(at='mean')
print(probit_marg.summary())

       Probit Marginal Effects       
=====================================
Dep. Variable:                 ecobuy
Method:                          dydx
At:                              mean
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ecoprc        -0.8508      0.120     -7.087      0.000      -1.086      -0.615
regprc         0.7619      0.143      5.334      0.000       0.482       1.042
faminc         0.0005      0.001      0.932      0.351      -0.001       0.002
hhsize         0.0259      0.014      1.894      0.058      -0.001       0.053
educ           0.0268      0.009      2.941      0.003       0.009       0.045
age           -0.0005      0.001     -0.340      0.734      -0.003       0.002
==============================================================================

# The marginal effect is the slope of the predictive function.
# Not surprisingly, the marginal effect is negative. As the price of eco apples increases, the probability that a household
# buys them decreases.

Miniproject

Nowcasting is using high-frequency data to learn about current economic conditions. Since official economic data comes at a lag, we don't know what, say, current-quarter GDP growth will be until the quarter is long over.

The Atlanta FRB does something like this with its GDP Now measure.

Let's see if we can "nowcast" recessions.

The NBER recession indicator is a binary variable equal to 1 if the US is in a recession. This is what we will nowcast. Fred code: USRECD.
A measure of the term spread is the difference in the yield on a 10-year treasury and a 2-year treasury. Fred code: T10Y2Y.
A measure of the credit spread is the yield on BAA corporate debt minus the yield on AAA corporate debt. Fed codes: DBAA and DAAA.
Download the data from FRED. The data start in 1985.
Compute the credit spread variable by subtracting the AAA yield from the BAA yield.
Plot the recession indicator, the term spread, and the credit spread.

import pandas_datareader.data as web
rec = web.DataReader(['USRECD', 'DBAA', 'DAAA', 'T10Y2Y'], 'fred', start = '1985-01-01')
rec['credit_spread'] = rec['DBAA'] - rec['DAAA']

fig, ax = plt.subplots(figsize=(15,8))


ax.plot(rec.index, rec['T10Y2Y'], color = 'red', label='Term spread')
ax.plot(rec.index, rec['credit_spread'], color='blue', label='Credit spread')

ax2 = ax.twinx()
ax2.plot(rec.index, rec['USRECD'], color = 'black', label='Recession\n(right axis)')

ax.set_ylabel('yield (%)', fontsize=16)
ax2.set_ylabel('recession = 1', fontsize=16)

ax.legend(frameon=False, fontsize=16, bbox_to_anchor=(0.65, 1))
ax2.legend(frameon=False, fontsize=16, loc='upper left')

for a in [ax, ax2]:
    a.tick_params(axis='both', labelsize=14)

plt.show()

png

Estimate a discrete model. I used

'USRECD ~ credit_spread + T10Y2Y'

Compute the predicted probabilities.
Plot the predicted probabilities and the recession indicator.
How did your model do?

res_rec = smf.logit('USRECD ~ credit_spread + T10Y2Y', data=rec).fit()
rec_probs = np.exp(res_rec.fittedvalues) / ( 1+np.exp(res_rec.fittedvalues) )

Optimization terminated successfully.
         Current function value: 0.194769
         Iterations 8

fig, ax = plt.subplots(figsize=(15,8))

ax.plot(rec_probs.index, rec_probs, color = 'blue', label='Recession\nprob.')

ax2 = ax.twinx()
ax2.plot(rec.index, rec['USRECD'], color = 'black', label='Recession\n(right axis)')

ax.set_ylabel('yield (%)', fontsize=16)
ax2.set_ylabel('recession = 1', fontsize=16)

ax.legend(frameon=False, fontsize=16, bbox_to_anchor=(0.18, 0.9))
ax2.legend(frameon=False, fontsize=16, loc='upper left')

for a in [ax, ax2]:
    a.tick_params(axis='both', labelsize=14)

plt.show()

png

If you finish early, experiment with different models. Or, add more indicators. VIX? SP500 growth rate?

You could try to forecast the recessions, too. In that case, you want to regress the future value of the recession variables on the indicators.