Regression Tables for Lazy People

In the spirit of writing as note-taking, I wanted to share a neat little trick in R for running regressions. By far the most common table in a political science paper is a regression table. Often, researchers run multiple regression specifications and then present them in a singular table. Each regression may have different sets of variables, or one specification will include an interaction effect. This can mean a lot of typing, which can implicitly violate DRY principles for coding.

I came across a nice use of the accumulate() function from the purrr package that both speeds up this task, and makes it programmatic for easy replication. Lazy win!

The basic concept behind accumulate() is to apply a function recursively over a list starting from the left. If you want to do so in reverse, then use accumulate_right(). Given the name, the main use of this function is for cumulative sums, but we can take advantage of the character formula ability of R to run different specifications.

In order to replicate this post on your own machine, you will need the following packages.

install.packages(c("AER", "tidyverse", "estimatr", "texreg"))

A fake data example

In the first demonstration, I create a simple dataset with three predictors, two potential outcomes, and the observed outcome based on a treatment variable.

By construction, the treatment effect is 1 and the data generating process includes multiple pre-treatment covariates and an interaction. The dataset also includes two useless variables. In experimental or administrative datasets, there are often a large number of covariates that are unused in regression specifications.

set.seed(42)
N = 1000
dat <- tibble(
  x1 = rnorm(N),
  x2 = rnorm(N, 1, 10),
  x3 = rnorm(N, 10, 3),
  z = sample(c(rep(1,N/2), rep(0,N/2)), N, replace = F),
  y0 = x1 + x2 + x3 + x2*x3+runif(N),
  y1 = y0 + 1,
  yobs = ifelse(z, y1, y0),
  useless1 = runif(N),
  useless2 = runif(N)
) 

Now, we can pass our columns of interest and let accumulate do the work.

cols_to_use <- c("z", "x1", "x2", "x3", 'x2*x3')
predictors <- accumulate(cols_to_use, function(a,b){paste(a,b, sep=" + ")})
formulas <- paste("yobs~", predictors)

print(formulas)
## [1] "yobs~ z"                        "yobs~ z + x1"                  
## [3] "yobs~ z + x1 + x2"              "yobs~ z + x1 + x2 + x3"        
## [5] "yobs~ z + x1 + x2 + x3 + x2*x3"

Neat! We now get our formula specifications in the right order for our table. Using lm_robust(), we can now estimate our regression. Unsurprisingly, the model with the interaction perfectly estimates the treatment effect.

# Functional programming with purrr using map()
# lm_robust requires that we coerce our character to a formula object 
formulas %>% 
  map(~lm_robust(as.formula(.x), data = dat, se_type = "stata"))%>%
  htmlreg(include.ci = F)
Statistical models
  Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 17.34*** 17.39*** 9.54*** -5.90 0.46***
  (4.81) (4.82) (1.21) (6.12) (0.03)
z 6.94 6.91 1.62 0.91 0.97***
  (7.18) (7.19) (1.99) (1.99) (0.02)
x1   1.63 0.57 0.67 0.99***
    (3.66) (0.97) (0.97) (0.01)
x2     11.06*** 11.07*** 1.00***
      (0.17) (0.17) (0.00)
x3       1.58** 1.00***
        (0.60) (0.00)
x2:x3         1.00***
          (0.00)
R2 0.00 0.00 0.92 0.93 1.00
Adj. R2 -0.00 -0.00 0.92 0.93 1.00
Num. obs. 1000 1000 1000 1000 1000
RMSE 113.49 113.53 31.38 31.02 0.29
***p < 0.001; **p < 0.01; *p < 0.05

An Application with Real Data

While none of the mechanics change from simulated data to real data (the point!), I find it sometimes helpful to show a technique against actual data. To do so, let’s use data on California schools available in the AER package. For this example, we are going to do some basic cleaning of the dataset to get a variable for the student-teacher ratio (STR), and average test score (score). In addition, we will create two binary variables for a high student-teacher ratio (HiSTR) and a high percentage of English learners in the schools (HiEL).

data("CASchools")

# Make use of dplyr's helpful data munging operations 
CASchools <- CASchools %>% 
  mutate(STR = students/teachers,
         score = (read + math)/2,
         HiSTR = as.numeric(STR >=20),
         HiEL = as.numeric(english >= 10))

# Exactly as before, but using the variables in the dataset
cols_to_use <- c("HiSTR", "HiEL", "HiSTR*HiEL")
predictors <- accumulate(cols_to_use, function(a,b){
  paste(a,b,sep = "+")
})

formulas  <- paste("score~", predictors)
formulas %>%
  map(~lm_robust(as.formula(.x), data = CASchools, se_type = "stata"))%>%
  htmlreg(include.ci = F)
Statistical models
  Model 1 Model 2 Model 3
(Intercept) 657.25*** 664.69*** 664.14***
  (1.25) (1.25) (1.39)
HiSTR -7.17*** -3.48* -1.91
  (1.83) (1.55) (1.93)
HiEL   -19.76*** -18.32***
    (1.59) (2.33)
HiSTR:HiEL     -3.26
      (3.12)
R2 0.03 0.29 0.29
Adj. R2 0.03 0.29 0.29
Num. obs. 420 420 420
RMSE 18.74 16.06 16.06
***p < 0.001; **p < 0.01; *p < 0.05

As we can see, functional programming principles can make our code easier to read and require less typing. The less we have to type, the less we are likely to accidentally introduce a mistake into our work. In addition, the more we get to be lazy, and being lazy is wonderful.

Alex Stephenson
PhD Student

Related