Introduction

You just estimated a regression, got results, and now either your model or your colleague suggests you change or elaborate on the specification. Repeat that a few times and reporting results quickly becomes a headache. It simply is not practical or palatable to report every single full regression table. You begin to sweat at the brow in yearning for some relief.

Smoke, mirrors, and a caffeinated and bespectacled mouse drops into your room. The mouse exclaims, “I have the aid you seek, you sweaty beast!”

Rarely will I offer kind words about STATA (or SAS) but there is one rather nifty functionality it offers that R, Python, and a few other open source tools still seem to lack: ability to run, organize, and report estimates from multiple regressions in a relatively high level yet flexible framework. By no means do I try to recreate STATA’s framework but I do exert effort to lighten your burden in case you have similar needs.

In this notebook, I mostly present the use of the functions but if you are curious about the source code, it’s available on GitHub

A non-confidential example

We’ll use the 2015 American Housing Survey subset of Raleigh, NC observations ( sample was uploaded to GitHub ) Suppose we are curious whether housing units deemed severely inadequate by the US Housing Authority fetch for lower rent. I estimate two specifications:

  • with structural characteristics

  • with structural and occupant characteristics

Specifically, for structural variables I include STORIES, BEDROOMS, and GARAGE while for occupant characteristics I include lengthTenure, NUMADULTS, HHAGE, and HHSEX. The variable of interest is severelyInaquate.

Suppose you have already loaded the functions and modules below then, we simply do the below

# independent variables
structural = ['BEDROOMS','GARAGE','STORIES','severelyInadequate']
occupant = ['HHAGE','NUMADULTS','lengthTenure']

# make two specifications in Patsy format
xList = ['+'.join(occupant)+"+"+"+".join(structural),
         '+'.join(structural)]

# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df,
                            specification_names = ['demographics and structure','structure'])
R^2 adj cond. num nObs severelyInadequate
specification
demographics and structure 0.149 365.514 711 -95.0526 (0.602)
structure 0.123 27.6683 711 -86.4810 (0.656)

Note that in the above case, I specified a name for each model specification but I don’t have to:

# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df)

R^2 adj cond. num nObs severelyInadequate
specification
0 0.149 365.514 711 -95.0526 (0.602)
1 0.123 27.6683 711 -86.4810 (0.656)
help(runRegressions)
Help on function runRegressions in module __main__:

runRegressions(y, xList, key_vars, data, specification_names=None)
    y - dependent variable (string)
    xList - list of explanatory variable strings (list of Patsy compatible formulas)
    key_vars - list of independent variables you wish to see results for
    df - a Pandas DataFrame that contains the data (list)
    specification_names - list of regression names (list of strings)
    
    Description: runRegressions estimates multiple regression specifications
    and returns coefficent estimate for key_vars, p-value, and other stats for each
    specification in a single table.

Full Table Yikes

Here is what a full statsmodels regression summary table looks like. Imagine asking yourself or your readers to read through many of these - especially if you are only interested in a small subset of the variables reported below. (scream face)

OLS Regression Results
Dep. Variable: RENT R-squared: 0.158
Model: OLS Adj. R-squared: 0.149
Method: Least Squares F-statistic: 12.01
Date: Wed, 10 Jul 2019 Prob (F-statistic): 1.61e-14
Time: 09:31:52 Log-Likelihood: -5442.8
No. Observations: 711 AIC: 1.090e+04
Df Residuals: 703 BIC: 1.094e+04
Df Model: 7
Covariance Type: HC0
coef std err z P>|z| [0.025 0.975]
Intercept -1.6921 129.803 -0.013 0.990 -256.102 252.718
GARAGE[T.Yes] 94.5023 53.770 1.758 0.079 -10.884 199.889
HHAGE 4.7165 2.014 2.342 0.019 0.770 8.663
NUMADULTS 91.5411 31.927 2.867 0.004 28.964 154.118
lengthTenure -13.3735 6.220 -2.150 0.032 -25.564 -1.183
BEDROOMS 134.4728 32.704 4.112 0.000 70.375 198.571
STORIES 139.3267 26.671 5.224 0.000 87.052 191.601
severelyInadequate -95.0526 182.032 -0.522 0.602 -451.830 261.724
Omnibus: 414.521 Durbin-Watson: 1.930
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4054.431
Skew: 2.477 Prob(JB): 0.00
Kurtosis: 13.598 Cond. No. 366.



Warnings:
[1] Standard Errors are heteroscedasticity robust (HC0)

The Code

Loading the data

#  load sample data (2015 AHS subsample for Raleigh, NC renters) from github
data_url = 'https://raw.githubusercontent.com/kiwiPhrases/multiple-regressions/master/AHS_2015_National_Metropolitan_renters_raleigh.csv'
df = pd.read_csv(data_url)

# drop columns that are all null
allNulls = df.columns[df.isnull().all()]
df.drop(allNulls, axis=1, inplace=True)

Functions

This isn’t a package yet so we can’t just import it like a regular module (yet)

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

def runRegression(y,x,data, cov_type='HC0'):
    print("Covariance type: %s" %cov_type )
    form = '{0} ~ {1}'.format(y,x)
    mod = smf.ols(formula=form, data=data)
    res = mod.fit(cov_type=cov_type)
    return(res)

def runQuantRegression(y,x,data,quant=.5):
    form = '{0} ~ {1}'.format(y,x)
    mod = smf.quantreg(formula=form, data=data)
    res = mod.fit(quant=.5)
    return(res)

def runRegressions(y, xList, key_vars, data, specification_names=None, cov_type='HC0', pvalues=True):
    """
    y - dependent variable (string)
    xList - list of explanatory variable strings (list of Patsy compatible formulas)
    key_vars - list of independent variables you wish to see results for
    df - a Pandas DataFrame that contains the data (list)
    specification_names - list of regression names (list of strings)
    
    Description: runRegressions estimates multiple regression specifications
    and returns coefficent and p-values estimates for key_vars and regressions
    stats for each specification in a single table. 
    """
    # if no custom names are given, use numbers
    if specification_names is None:
        specification_names = [str(i) for i in range(len(xList))]
    if len(specification_names) != len(xList):
        print('specification names not the same length as independent variables list')
        return(None)
        
    # run regressions    
    regDict = {}
    for x, name in zip(xList, specification_names):
        print("Estimating specification: %s" %name)
        regDict[name] = runRegression(y,x,data, cov_type='HC0')
    print("Estimation done")   
    
    # extract estimates:
    print("Extracting results")
    outDict = {}    
    for key in regDict.keys():
        # get regression statistics
        outDict[key] = {'nObs':regDict[key].nobs,'R^2 adj':regDict[key].rsquared_adj.round(3), 'cond. num':regDict[key].condition_number}
        
        # loop over variables of interest
        if pvalues:
            for var in key_vars:
                outDict[key][var] = "%.04f (%.03f)" %(regDict[key].params[var], regDict[key].pvalues[var])
        if pvalues == False:
            outDict[key] = regDict[key].params.round(3).astype(str) + pd.cut(regDict[key].pvalues,bins=[0,.01,.05,.1,1], labels=['***','**','*','']).astype(str)
            
        
    outdf =pd.DataFrame(outDict).transpose()
    outdf.index.name='specification'
    return(outdf)