download notebook

Exporting regression results

files needed = ('sleep75.dta', 'wage1.dta')

You've run some regressions. Now you need to get the results of your regressions into a file for publication. This usually means a table of results. Tables are no different from figures: Your tables should help the reader understand the points you are trying to make and they should display graphical excellence.

The biggest enemy of graphical excellence when it comes to tables is including too much. Especially if the table is part of the main text.

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

Regression 1

How do hours of sleep vary with working? Do we trade off sleep for work? We control for education and age.

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \epsilon. $$

[This is in problem 3, chapter 3 in Wooldrigde.]

sleep = pd.read_stata('sleep75.dta')
results1 = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(results1.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.110     
Dependent Variable: sleep            AIC:                10534.2126
Date:               2020-11-23 14:18 BIC:                10552.4511
No. Observations:   706              Log-Likelihood:     -5263.1   
Df Model:           3                F-statistic:        29.92     
Df Residuals:       702              Prob (F-statistic): 3.28e-18  
R-squared:          0.113            Scale:              1.7586e+05
-------------------------------------------------------------------
                Coef.   Std.Err.    t    P>|t|    [0.025    0.975] 
-------------------------------------------------------------------
Intercept     3638.2453 112.2751 32.4047 0.0000 3417.8101 3858.6805
totwrk          -0.1484   0.0167 -8.8881 0.0000   -0.1811   -0.1156
educ           -11.1338   5.8846 -1.8920 0.0589  -22.6873    0.4197
age              2.1999   1.4457  1.5217 0.1285   -0.6386    5.0383
-------------------------------------------------------------------
Omnibus:               68.731       Durbin-Watson:          1.943  
Prob(Omnibus):         0.000        Jarque-Bera (JB):       185.551
Skew:                  -0.496       Prob(JB):               0.000  
Kurtosis:              5.308        Condition No.:          16553  
===================================================================
* The condition number is large (2e+04). This might indicate
strong multicollinearity or other numerical problems.

Well, here is a table. One approach would be to take a screenshot of the table and insert it as a figure into a document. Do not do this. Why?

  1. This table contains too much information, especially for a non-academic audience. Should you include the Jarque-Bera statistic? The Omnibus statistic? Will your reader know what these are?

  2. Your screenshot will suffer from the same problems that png plots suffer: They become blurry when zoomed.

The correct approach is to export the results you want as text and include them in your document. How should we do that? We could copy over the numbers manually, which is prone to error and tedious. Or, we could use some methods from statsmodels to generate the table for us. That sounds like a better approach.

Regression 2

Let's add a second regression so we can build a table with two sets of results. This is commonly found in academic literature.

results2 = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(results2.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.109     
Dependent Variable: sleep            AIC:                10534.5698
Date:               2020-11-23 14:18 BIC:                10552.8082
No. Observations:   706              Log-Likelihood:     -5263.3   
Df Model:           3                F-statistic:        29.79     
Df Residuals:       702              Prob (F-statistic): 3.91e-18  
R-squared:          0.113            Scale:              1.7595e+05
-------------------------------------------------------------------
                Coef.   Std.Err.    t    P>|t|    [0.025    0.975] 
-------------------------------------------------------------------
Intercept     3440.9308 239.4477 14.3703 0.0000 2970.8113 3911.0503
totwrk          -0.1489   0.0167 -8.9239 0.0000   -0.1817   -0.1162
educ           -11.3493   5.8811 -1.9298 0.0540  -22.8961    0.1974
np.log(age)     79.2414  56.6123  1.3997 0.1620  -31.9083  190.3910
-------------------------------------------------------------------
Omnibus:               68.734       Durbin-Watson:          1.944  
Prob(Omnibus):         0.000        Jarque-Bera (JB):       184.854
Skew:                  -0.497       Prob(JB):               0.000  
Kurtosis:              5.301        Condition No.:          36127  
===================================================================
* The condition number is large (4e+04). This might indicate
strong multicollinearity or other numerical problems.

Exporting tables

statsmodels has an iolib module with useful methods for getting your results out of python and into a word processor. It is not very well documented.

from statsmodels.iolib.summary2 import summary_col

# docs: https://www.kite.com/python/docs/statsmodels.iolib.summary2.summary_col

summary_col()

summary_col() takes several model results and turns them into the kind of table you see in journals. The syntax:

  • The first input is a list with the results objects from the regressions. The order they are listed is the order in the table.
  • model_names is the header text above each object.
  • stars=True give levels of significance stars
  • regressor_order is a list of the regressor names to determine their order. Would you like the intercept first, or last?
  • If drop_omitted=True, then any regressor not in regressor_order will be dropped. Don't want the intercept at all? Set drop_omitted=True and leave it out of regressor_order.
  • float_format controls the format of the floats. The format strings follows the standard syntax.

After creating the table I use the add_title() and add_text() methods to add a tile and any additional notes to the table.

I print out the table as text using the as_text() method.

table = summary_col(
    [results1, results2], 
    model_names = ['Sleep on age', 'Sleep on log(age)'],
    stars=True, 
    regressor_order = ['Intercept', 'totwrk', 'educ', 'age',  'np.log(age)'],
    float_format='%0.2f',
    drop_omitted = True
    )

table.add_title('Sleep and work')
table.add_text('More text to append.')

print(table.as_text())
               Sleep and work
=============================================
               Sleep on age Sleep on log(age)
---------------------------------------------
Intercept      3638.25***   3440.93***       
               (112.28)     (239.45)         
totwrk         -0.15***     -0.15***         
               (0.02)       (0.02)           
educ           -11.13*      -11.35*          
               (5.88)       (5.88)           
age            2.20                          
               (1.45)                        
np.log(age)                 79.24            
                            (56.61)          
R-squared      0.11         0.11             
R-squared Adj. 0.11         0.11             
=============================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
More text to append.

This looks pretty good.

In my installation, I have a known bug. The output contains the \(R^2\) and Adjusted \(R^2\). It is not supposed to do this! When I tested this on winstat I did not get this bug. Winstat must be running a different version than I am. The good news for me is that I can just erase those two lines of text in the output and not worry about it. Still, I am looking forward to getting the bug fixed.

How do I get the text out of python and into a file? We use python's standard open() and write() methods to write to a text file.

# The 'w' is for 'write'.
fout = open('table1.txt', 'w')
fout.write(table.as_text())
fout.close()

Go take a look in your cwd (or whatever folder you saved the file to) and open it up.

File formats

We can export the table as text, latex, or html. We have already exported text using the .as_text() method of summary_col().

Latex

The title becomes the \caption{} in the latex code and the argument 'label' in the as_latex() becomes the \label{}.

# Create latex table. 

table = summary_col(
    [results1, results2], 
    model_names = ['Sleep on age', 'Sleep on log(age)'],
    stars=True, 
    regressor_order = ['Intercept', 'totwrk', 'educ', 'age',  'np.log(age)'],
    float_format='%0.2f',
    drop_omitted = True,
    )

table.add_title('Sleep and work')
table.add_text('More text to append.')

fout = open('table1.tex', 'w')
fout.write(table.as_latex(label='tab:reg1'))
fout.close()

Go take a look at the output in your cwd. It looks pretty good but the extra text at the bottom is missing. It seems that statsmodels is still rough around the edges compared to some of the advanced table packages in STATA.

HTML

This would be useful if posting to a blog or webpage.

# Create html table. 

table = summary_col(
    [results1, results2], 
    model_names = ['Sleep on age', 'Sleep on log(age)'],
    stars=True, 
    regressor_order = ['Intercept', 'totwrk', 'educ', 'age',  'np.log(age)'],
    float_format='%0.2f',
    drop_omitted = True
    )

table.add_title('Sleep and work')
table.add_text('More text to append.')

fout = open('table1.html', 'w')
fout.write(table.as_html())
fout.close()

Post export

Once you have the tables exported as text, you can paste them into your latex document. You will likely need to tweak the code to get the table exactly right. For example, I use the booktabs package to make my tables so I would need to modify the way that horizontal lines are drawn. (Never use vertical lines in a table.)

If you are using word, the simplest approach seems to be to past the latex code into a word document, highlight the text, and use the 'convert text to table' button to make the table. Specify that the columns are split on the '&' and enter the appropriate number of columns. Then you can start formatting the table. This approach is more labor intensive than latex, which is why most academics STEM-types write in latex.

Overall the statsmodels iolib is a good start at exporting tables but still has a long way to go before it is as well-developed as STATA's table export packages. This is a place where python is behind...