Tools for Estimating and Reporting Multiple Specifications
Introduction
You just estimated a regression, got results, and now either your model or your colleague suggests you change or elaborate on the specification. Repeat that a few times and reporting results quickly becomes a headache. It simply is not practical or palatable to report every single full regression table. You begin to sweat at the brow in yearning for some relief.
Smoke, mirrors, and a caffeinated and bespectacled mouse drops into your room. The mouse exclaims, “I have the aid you seek, you sweaty beast!”
Rarely will I offer kind words about STATA (or SAS) but there is one rather nifty functionality it offers that R, Python, and a few other open source tools still seem to lack: ability to run, organize, and report estimates from multiple regressions in a relatively high level yet flexible framework. By no means do I try to recreate STATA’s framework but I do exert effort to lighten your burden in case you have similar needs.
In this notebook, I mostly present the use of the functions but if you are curious about the source code, it’s available on GitHub
A non-confidential example
We’ll use the 2015 American Housing Survey subset of Raleigh, NC observations ( sample was uploaded to GitHub ) Suppose we are curious whether housing units deemed severely inadequate by the US Housing Authority fetch for lower rent. I estimate two specifications:
-
with structural characteristics
-
with structural and occupant characteristics
Specifically, for structural variables I include STORIES, BEDROOMS, and GARAGE while for occupant characteristics I include lengthTenure, NUMADULTS, HHAGE, and HHSEX. The variable of interest is severelyInaquate.
Suppose you have already loaded the functions and modules below then, we simply do the below
# independent variables
structural = ['BEDROOMS','GARAGE','STORIES','severelyInadequate']
occupant = ['HHAGE','NUMADULTS','lengthTenure']
# make two specifications in Patsy format
xList = ['+'.join(occupant)+"+"+"+".join(structural),
'+'.join(structural)]
# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df,
specification_names = ['demographics and structure','structure'])
R^2 adj | cond. num | nObs | severelyInadequate | |
---|---|---|---|---|
specification | ||||
demographics and structure | 0.149 | 365.514 | 711 | -95.0526 (0.602) |
structure | 0.123 | 27.6683 | 711 | -86.4810 (0.656) |
Note that in the above case, I specified a name for each model specification but I don’t have to:
# estimate the regressions and get output
runRegressions('RENT',xList,key_vars=['severelyInadequate'], data = df)
R^2 adj | cond. num | nObs | severelyInadequate | |
---|---|---|---|---|
specification | ||||
0 | 0.149 | 365.514 | 711 | -95.0526 (0.602) |
1 | 0.123 | 27.6683 | 711 | -86.4810 (0.656) |
help(runRegressions)
Help on function runRegressions in module __main__:
runRegressions(y, xList, key_vars, data, specification_names=None)
y - dependent variable (string)
xList - list of explanatory variable strings (list of Patsy compatible formulas)
key_vars - list of independent variables you wish to see results for
df - a Pandas DataFrame that contains the data (list)
specification_names - list of regression names (list of strings)
Description: runRegressions estimates multiple regression specifications
and returns coefficent estimate for key_vars, p-value, and other stats for each
specification in a single table.
Full Table Yikes
Here is what a full statsmodels regression summary table looks like. Imagine asking yourself or your readers to read through many of these - especially if you are only interested in a small subset of the variables reported below. (scream face)
Dep. Variable: | RENT | R-squared: | 0.158 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.149 |
Method: | Least Squares | F-statistic: | 12.01 |
Date: | Wed, 10 Jul 2019 | Prob (F-statistic): | 1.61e-14 |
Time: | 09:31:52 | Log-Likelihood: | -5442.8 |
No. Observations: | 711 | AIC: | 1.090e+04 |
Df Residuals: | 703 | BIC: | 1.094e+04 |
Df Model: | 7 | ||
Covariance Type: | HC0 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.6921 | 129.803 | -0.013 | 0.990 | -256.102 | 252.718 |
GARAGE[T.Yes] | 94.5023 | 53.770 | 1.758 | 0.079 | -10.884 | 199.889 |
HHAGE | 4.7165 | 2.014 | 2.342 | 0.019 | 0.770 | 8.663 |
NUMADULTS | 91.5411 | 31.927 | 2.867 | 0.004 | 28.964 | 154.118 |
lengthTenure | -13.3735 | 6.220 | -2.150 | 0.032 | -25.564 | -1.183 |
BEDROOMS | 134.4728 | 32.704 | 4.112 | 0.000 | 70.375 | 198.571 |
STORIES | 139.3267 | 26.671 | 5.224 | 0.000 | 87.052 | 191.601 |
severelyInadequate | -95.0526 | 182.032 | -0.522 | 0.602 | -451.830 | 261.724 |
Omnibus: | 414.521 | Durbin-Watson: | 1.930 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4054.431 |
Skew: | 2.477 | Prob(JB): | 0.00 |
Kurtosis: | 13.598 | Cond. No. | 366. |
Warnings:
[1] Standard Errors are heteroscedasticity robust (HC0)
The Code
Loading the data
# load sample data (2015 AHS subsample for Raleigh, NC renters) from github
data_url = 'https://raw.githubusercontent.com/kiwiPhrases/multiple-regressions/master/AHS_2015_National_Metropolitan_renters_raleigh.csv'
df = pd.read_csv(data_url)
# drop columns that are all null
allNulls = df.columns[df.isnull().all()]
df.drop(allNulls, axis=1, inplace=True)
Functions
This isn’t a package yet so we can’t just import it like a regular module (yet)
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
def runRegression(y,x,data, cov_type='HC0'):
print("Covariance type: %s" %cov_type )
form = '{0} ~ {1}'.format(y,x)
mod = smf.ols(formula=form, data=data)
res = mod.fit(cov_type=cov_type)
return(res)
def runQuantRegression(y,x,data,quant=.5):
form = '{0} ~ {1}'.format(y,x)
mod = smf.quantreg(formula=form, data=data)
res = mod.fit(quant=.5)
return(res)
def runRegressions(y, xList, key_vars, data, specification_names=None, cov_type='HC0', pvalues=True):
"""
y - dependent variable (string)
xList - list of explanatory variable strings (list of Patsy compatible formulas)
key_vars - list of independent variables you wish to see results for
df - a Pandas DataFrame that contains the data (list)
specification_names - list of regression names (list of strings)
Description: runRegressions estimates multiple regression specifications
and returns coefficent and p-values estimates for key_vars and regressions
stats for each specification in a single table.
"""
# if no custom names are given, use numbers
if specification_names is None:
specification_names = [str(i) for i in range(len(xList))]
if len(specification_names) != len(xList):
print('specification names not the same length as independent variables list')
return(None)
# run regressions
regDict = {}
for x, name in zip(xList, specification_names):
print("Estimating specification: %s" %name)
regDict[name] = runRegression(y,x,data, cov_type='HC0')
print("Estimation done")
# extract estimates:
print("Extracting results")
outDict = {}
for key in regDict.keys():
# get regression statistics
outDict[key] = {'nObs':regDict[key].nobs,'R^2 adj':regDict[key].rsquared_adj.round(3), 'cond. num':regDict[key].condition_number}
# loop over variables of interest
if pvalues:
for var in key_vars:
outDict[key][var] = "%.04f (%.03f)" %(regDict[key].params[var], regDict[key].pvalues[var])
if pvalues == False:
outDict[key] = regDict[key].params.round(3).astype(str) + pd.cut(regDict[key].pvalues,bins=[0,.01,.05,.1,1], labels=['***','**','*','']).astype(str)
outdf =pd.DataFrame(outDict).transpose()
outdf.index.name='specification'
return(outdf)