download notebook
view notebook w/ solutions

Regression with discrete dependent variables

files needed = ('pntsprd.dta', 'apple.dta')

We continue to learn about the statsmodels package (docs), which provides functions for formulating and estimating statistical models. This notebook will not address the models, per se, but will focus on how to take what you learned in econometrics class and use it in python.

In this notebook, we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize), but many of the techniques we demo here can be extended to dependent variables with a discrete number of values.

Here is a nice overview of the discrete choice models in statsmodels.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

from google.colab import drive
drive.mount('/content/drive')

Reading Stata data files

We will work on a problem from Wooldridge's textbook in econometrics. We read the (mercifully) cleaned data files using the pandas method .read_stata( ) that reads stata files.

The file 'pntsprd.dta' contains data about Vegas betting. The complete variable list is here. We will use favwin which is equal to 1 if the favored team won and zero otherwise and spread which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.

For example, if Team A if favored by 3 points over Team B, then Team A must win by more than 3 points for the bet on Team A to pay off.

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('/content/drive/MyDrive/Data Class/6-logit/pntsprd.dta')

# Take a look...so clean!
vegas.head()

vegas.info()

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue')

ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')

sea.despine(ax=ax)

OLS

We begin with the linear probability model. The model is

$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$

There is nothing new here technique-wise. We are estimating this with ols, see the ols notebook for a refresher.

# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()

print(res_ols.summary())

Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one!

We can use the fittedvalues attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look.

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], res_ols.fittedvalues,  facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')

sea.despine(ax=ax, trim=True)

Logistic regression (logit)

The logistic regression passes the linear model through a non-linear function that constrains the output to lie between zero and one. (These functions are cumulative distribution functions.) In the logistic case, the function looks like

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$

and we predict a team wins when ever $\text{prob} \ge 0.5$.

We estimate the logit model with the logit( ) method from smf in a way similar to ols.

res_log = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_log.summary())

Interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression

$$\text{prob} = \frac{\exp \left({\hat{\beta}_0+\hat{\beta}_1 spread}\right)}{1+\exp \left({\hat{\beta}_0+\hat{\beta}_1 spread}\right)}$$

The res_log.fittedvalues holds the fitted value of $\hat{\beta}_0+\hat{\beta}_1 spread$ and not the estimated probability. Let's compute it using the exp method of numpy.

res_log.fittedvalues

pred_probs = np.exp(res_log.fittedvalues) / ( 1+np.exp(res_log.fittedvalues) )
pred_probs.describe()

Plot the estimated probabilty of the favored team winning and the actual data.

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_probs,  facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a logit model')

ax.legend(frameon=False, fontsize=16)
sea.despine(ax=ax, trim=True)

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Load the data 'apple.dta'. The data dictionary can be found here. The variable ecolbs is purchases of eco-friendly apples (whatever that means).
Create a variable named ecobuy that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0). This will be our left-hand side variable.

Remember that

apples['ecobuy'] = (apples['ecolbs']>0)

Will create a variable of `True` and `False`. Then you can use `.astype(int)` to convert your True/False to zero/one.

Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics.

$$\text{ecobuy} = \beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} + \epsilon$$

How many estimated probabilities are negative? Are greater than one?
Now estimate the model as a probit. A probit is similar to a logit in that we are passing the linear model through a nonlinear function. In this case, the nonlinear function is the cumulative density function of the normal distribution.

$$\text{Pr}(\text{ecobuy}=1 \mid X) = \Phi \left(\beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} \right),$$

where $\Phi( )$ is the CDF of the normal distribution. Try smf.probit( ).

Compute the marginal effects of the coefficients at the means and print them out using summary(). You can get the marginal effects from the results object using .get_margeff(at='mean) (docs).

The marginal effect tells us the marginal change in predicted probability as the independent variables change. What does the marginal effect on ecoprc (the price of eco apples) tell us?

Miniproject

Nowcasting is using high-frequency data to learn about current economic conditions. Since official economic data comes at a lag, we don't know what, say, current-quarter GDP growth will be until the quarter is long over.

The Atlanta FRB does something like this with its GDP Now measure.

Let's see if we can "nowcast" recessions.

The NBER recession indicator is a binary variable equal to 1 if the US is in a recession. This is what we will nowcast. Fred code: USRECD.
A measure of the term spread is the difference in the yield on a 10-year treasury and a 2-year treasury. Fred code: T10Y2Y.
A measure of the credit spread is the yield on BAA corporate debt minus the yield on AAA corporate debt. Fed codes: DBAA and DAAA.
Download the data from FRED. The data start in 1985.
Compute the credit spread variable by subtracting the AAA yield from the BAA yield.
Plot the recession indicator, the term spread, and the credit spread.
Estimate a discrete model. I used

'USRECD ~ credit_spread + T10Y2Y'

Compute the predicted probabilities.
Plot the predicted probabilities and the recession indicator.
How did your model do?

If you finish early, experiment with different models. Or, add more indicators. VIX? SP500 growth rate?

You could try to forecast the recessions, too. In that case, you want to regress the future value of the recession variables on the indicators.