Introduction

You just estimated a regression, got results, and now either your model or your colleague suggests you change or elaborate on the specification. Repeat that a few times and reporting results quickly becomes a headache. It simply is not practical or palatable to report every single full regression table. You begin to sweat at the brow in yearning for some relief.

Smoke, mirrors, and a caffeinated and bespectacled mouse drops into your room. The mouse exclaims, “I have the aid you seek, you sweaty beast!”

Rarely will I offer kind words about STATA (or SAS) but there is one rather nifty functionality it offers that R, Python, and a few other open source tools still seem to lack: ability to run, organize, and report estimates from multiple regressions in a relatively high level yet flexible framework. By no means do I try to recreate STATA’s framework but I do exert effort to lighten your burden in case you have similar needs.

In this notebook, I mostly present the use of the functions but if you are curious about the source code, it’s available on GitHub

A non-confidential example

We’ll use the 2015 American Housing Survey subset of Raleigh, NC observations ( sample was uploaded to GitHub ) Suppose we are curious whether housing units deemed severely inadequate by the US Housing Authority fetch for lower rent. I estimate two specifications:

with structural characteristics
with structural and occupant characteristics

Specifically, for structural variables I include STORIES, BEDROOMS, and GARAGE while for occupant characteristics I include lengthTenure, NUMADULTS, HHAGE, and HHSEX. The variable of interest is severelyInaquate.

Suppose you have already loaded the functions and modules below then, we simply do the below

# independent variables
structural = ['BEDROOMS','GARAGE','STORIES','severelyInadequate']
occupant = ['HHAGE','NUMADULTS','lengthTenure']

# make two specifications in Patsy format
xList = ['+'.join(occupant)+"+"+"+".join(structural),
         '+'.join(structural)]

# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df,
                            specification_names = ['demographics and structure','structure'])

	R^2 adj	cond. num	nObs	severelyInadequate
specification
demographics and structure	0.149	365.514	711	-95.0526 (0.602)
structure	0.123	27.6683	711	-86.4810 (0.656)

Note that in the above case, I specified a name for each model specification but I don’t have to:

# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df)

	R^2 adj	cond. num	nObs	severelyInadequate
specification
0	0.149	365.514	711	-95.0526 (0.602)
1	0.123	27.6683	711	-86.4810 (0.656)

help(runRegressions)

Help on function runRegressions in module __main__:

runRegressions(y, xList, key_vars, data, specification_names=None)
    y - dependent variable (string)
    xList - list of explanatory variable strings (list of Patsy compatible formulas)
    key_vars - list of independent variables you wish to see results for
    df - a Pandas DataFrame that contains the data (list)
    specification_names - list of regression names (list of strings)
    
    Description: runRegressions estimates multiple regression specifications
    and returns coefficent estimate for key_vars, p-value, and other stats for each
    specification in a single table.

Full Table Yikes

Here is what a full statsmodels regression summary table looks like. Imagine asking yourself or your readers to read through many of these - especially if you are only interested in a small subset of the variables reported below. (scream face)

OLS Regression Results
Dep. Variable:	RENT	R-squared:	0.158
Model:	OLS	Adj. R-squared:	0.149
Method:	Least Squares	F-statistic:	12.01
Date:	Wed, 10 Jul 2019	Prob (F-statistic):	1.61e-14
Time:	09:31:52	Log-Likelihood:	-5442.8
No. Observations:	711	AIC:	1.090e+04
Df Residuals:	703	BIC:	1.094e+04
Df Model:	7
Covariance Type:	HC0

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.6921	129.803	-0.013	0.990	-256.102	252.718
GARAGE[T.Yes]	94.5023	53.770	1.758	0.079	-10.884	199.889
HHAGE	4.7165	2.014	2.342	0.019	0.770	8.663
NUMADULTS	91.5411	31.927	2.867	0.004	28.964	154.118
lengthTenure	-13.3735	6.220	-2.150	0.032	-25.564	-1.183
BEDROOMS	134.4728	32.704	4.112	0.000	70.375	198.571
STORIES	139.3267	26.671	5.224	0.000	87.052	191.601
severelyInadequate	-95.0526	182.032	-0.522	0.602	-451.830	261.724

Omnibus:	414.521	Durbin-Watson:	1.930
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4054.431
Skew:	2.477	Prob(JB):	0.00
Kurtosis:	13.598	Cond. No.	366.

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC0)

The Code

Loading the data

#  load sample data (2015 AHS subsample for Raleigh, NC renters) from github
data_url = 'https://raw.githubusercontent.com/kiwiPhrases/multiple-regressions/master/AHS_2015_National_Metropolitan_renters_raleigh.csv'
df = pd.read_csv(data_url)

# drop columns that are all null
allNulls = df.columns[df.isnull().all()]
df.drop(allNulls, axis=1, inplace=True)

Functions

This isn’t a package yet so we can’t just import it like a regular module (yet)

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

def runRegression(y,x,data, cov_type='HC0'):
    print("Covariance type: %s" %cov_type )
    form = '{0} ~ {1}'.format(y,x)
    mod = smf.ols(formula=form, data=data)
    res = mod.fit(cov_type=cov_type)
    return(res)

def runQuantRegression(y,x,data,quant=.5):
    form = '{0} ~ {1}'.format(y,x)
    mod = smf.quantreg(formula=form, data=data)
    res = mod.fit(quant=.5)
    return(res)

def runRegressions(y, xList, key_vars, data, specification_names=None, cov_type='HC0', pvalues=True):
    """
    y - dependent variable (string)
    xList - list of explanatory variable strings (list of Patsy compatible formulas)
    key_vars - list of independent variables you wish to see results for
    df - a Pandas DataFrame that contains the data (list)
    specification_names - list of regression names (list of strings)
    
    Description: runRegressions estimates multiple regression specifications
    and returns coefficent and p-values estimates for key_vars and regressions
    stats for each specification in a single table. 
    """
    # if no custom names are given, use numbers
    if specification_names is None:
        specification_names = [str(i) for i in range(len(xList))]
    if len(specification_names) != len(xList):
        print('specification names not the same length as independent variables list')
        return(None)
        
    # run regressions    
    regDict = {}
    for x, name in zip(xList, specification_names):
        print("Estimating specification: %s" %name)
        regDict[name] = runRegression(y,x,data, cov_type='HC0')
    print("Estimation done")   
    
    # extract estimates:
    print("Extracting results")
    outDict = {}    
    for key in regDict.keys():
        # get regression statistics
        outDict[key] = {'nObs':regDict[key].nobs,'R^2 adj':regDict[key].rsquared_adj.round(3), 'cond. num':regDict[key].condition_number}
        
        # loop over variables of interest
        if pvalues:
            for var in key_vars:
                outDict[key][var] = "%.04f (%.03f)" %(regDict[key].params[var], regDict[key].pvalues[var])
        if pvalues == False:
            outDict[key] = regDict[key].params.round(3).astype(str) + pd.cut(regDict[key].pvalues,bins=[0,.01,.05,.1,1], labels=['***','**','*','']).astype(str)
            
        
    outdf =pd.DataFrame(outDict).transpose()
    outdf.index.name='specification'
    return(outdf)