knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
theme_set(theme_minimal(base_size = 14))
set.seed(1)

Regression models with simulated data

In this notebook we will do experiments to understand the limitations of using regression models to make conclusions about causality. We make an important assumption here that the way we generate data determines all causal relationships. Remember that this is an assumption we choose to adopt now, but that in general we do not always assume anything about causality in our simulations.

For our purposes here the assumption works like this: if we change one variable then we must also propagate this change to any of the variables that functionally depend on it. Two examples below will make this clearer.

Direct causation X -> Y

Causal model

This describes how the simulated world works.

# Direct causation: X -> Y
n <- 200
X <- rnorm(n, mean = 1)
Y_CEF <- function(X) 2 * X
Y <- Y_CEF(X) + rnorm(n)

What would we conclude from regression on data like this?

# or GGally::ggcoef() for a plot
lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term estimate std.error statistic p.value
(Intercept) 0.0645357 0.1074118 0.6008253 0.5486435
X 1.9769222 0.0772917 25.5774173 0.0000000

Counterfactual values for X

Below are several examples of different kinds of interventions. We can use these to ask various questions like, “What if the value of X had been different in this particular way? Would that have changed other variables as well?”

#X_new <- rep(0, n) # "atomic" intervention, set all X to a constant, e.g. 0
X_new <- X + 1 # constant additive intervention
#X_new <- rnorm(n, mean = 2) # shift in mean

Propagating the intervention

After intervening on a variable we have to re-generate all the other variables that depend on it using its new values.

Y_new <- Y_CEF(X_new) + rnorm(n)

Measuring the causal effect

# Average causal effect (ATE)
mean(Y_new - Y)
## [1] 1.917536

There is another possible treatment effect we may be interested in. Sometimes the magnitude/direction of the treatment effect depends in an important way on another variable, like one of the predictors or covariates. In cases like this we may want to know how the treatment effect varies as we look at different values of the covariate, as in the plot below.

# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

Exercise: Can you change Y_CEF so that the regression estimate is highly misleading?

Exercise: Can you make the ATE and the CATE provide very different conclusions about the causal effect of X?

Unmeasured confounder X <- U -> Y

Causal model

In this world there is no causal relationship between X and Y

n <- 1000
U <- rnorm(n)
X <- -2 * U + U^2 + rnorm(n)
Y_CEF <- function(U) 4 - U
Y <- Y_CEF(U) + rnorm(n)

But if we fit regression models we might believe there is a relationship:

lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term estimate std.error statistic p.value
(Intercept) 3.8080350 0.0423095 90.00426 0
X 0.2626668 0.0139554 18.82192 0

And someone looking at a plot might conclude there is a clear, strong relationship:

qplot(X, Y)

Intervening on X

# Try changing this
X_new <- X + 1

Propagating the intervention

In this model X <- U -> Y there are no other variables on directed pathways out of X so nothing else is changed.

U_new <- U
Y_new <- Y_CEF(U_new) + rnorm(n) 

Measuring the causal effect

All causal effects on Y are 0

# ATE
mean(Y_new - Y)
## [1] -0.1262453
# CATE
qplot(X, Y_new - Y, geom=c("point")) 

Confounded relationship X -> Y <- Z and Z -> X

Causal model

Now there is another measured variable Z which effects both X and Y. We might be only interested (for science or profit) in the causal relationship between X and Y, or the “direct effect” between Z and Y, or the “total effect” between Z and Y. We’ll consider each of these below.

n <- 500
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y_CEF <- function(X, Z) 4 + 2 * Z - 3 * X
Y <- Y_CEF(X, Z) + rnorm(n)

Regression model outputs. Which of the coefficients below (if any) give the right answer to which causal questions?

lm(Y ~ Z) |> broom::tidy() |> knitr::kable()
term estimate std.error statistic p.value
(Intercept) -10.955208 0.1385322 -79.08060 0
Z 7.792691 0.1404004 55.50335 0
lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term estimate std.error statistic p.value
(Intercept) 7.901970 0.1441691 54.81045 0
X -3.784157 0.0263265 -143.73930 0
lm(Y ~ X + Z) |> broom::tidy() |> knitr::kable()
term estimate std.error statistic p.value
(Intercept) 4.192522 0.2327328 18.01432 0
X -3.038416 0.0458332 -66.29296 0
Z 1.827038 0.1005229 18.17535 0

Changing X only

# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z
X_new <- X + 1
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)

Measuring the causal effect

mean(Y_new - Y)
## [1] -3.029189

CATE at different values of X

# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

CATE at different values of Z

# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

Changing Z but not X (direct effect)

Note that we are making an exception to our usual rule about propagating changes in the model. In this case we hold X fixed because we are interested in understanding only the “direct effect” of Z on Y, and not any change that would normally also go through X.

# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z + 1
X_new <- X
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)

Measuring the causal effect

mean(Y_new - Y)
## [1] 1.968833

CATE at different values of X

# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

CATE at different values of Z

# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

Changing Z (total effect)

# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z + 1
X_new <- 5 -2 * Z_new + rnorm(n)
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)

Measuring the causal effect

mean(Y_new - Y)
## [1] 8.042156

CATE at different values of X

# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

CATE at different values of Z

# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
  geom_hline(yintercept = mean(Y_new - Y))

Experiment suggestions

Note: In this example the generating code is copy/pasted in 3 places, so any change you make you need to make the same change in all 3 code chunks. You might want to copy/paste the entire section and only edit the new copy of it.

  • Try changing the functions/distributions generating each variable (but without altering the structure of the underlying DAG).

  • Try to understand when regression coefficients give accurate answers, which causal effects they may be measuring, and when they do not give accurate answers.