knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
theme_set(theme_minimal(base_size = 14))
set.seed(1)
In this notebook we will do experiments to understand the limitations of using regression models to make conclusions about causality. We make an important assumption here that the way we generate data determines all causal relationships. Remember that this is an assumption we choose to adopt now, but that in general we do not always assume anything about causality in our simulations.
For our purposes here the assumption works like this: if we change one variable then we must also propagate this change to any of the variables that functionally depend on it. Two examples below will make this clearer.
X -> Y
This describes how the simulated world works.
# Direct causation: X -> Y
n <- 200
X <- rnorm(n, mean = 1)
Y_CEF <- function(X) 2 * X
Y <- Y_CEF(X) + rnorm(n)
What would we conclude from regression on data like this?
# or GGally::ggcoef() for a plot
lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0645357 | 0.1074118 | 0.6008253 | 0.5486435 |
X | 1.9769222 | 0.0772917 | 25.5774173 | 0.0000000 |
X
Below are several examples of different kinds of interventions. We
can use these to ask various questions like, “What if the value of
X
had been different in this particular way? Would that
have changed other variables as well?”
#X_new <- rep(0, n) # "atomic" intervention, set all X to a constant, e.g. 0
X_new <- X + 1 # constant additive intervention
#X_new <- rnorm(n, mean = 2) # shift in mean
After intervening on a variable we have to re-generate all the other variables that depend on it using its new values.
Y_new <- Y_CEF(X_new) + rnorm(n)
# Average causal effect (ATE)
mean(Y_new - Y)
## [1] 1.917536
There is another possible treatment effect we may be interested in. Sometimes the magnitude/direction of the treatment effect depends in an important way on another variable, like one of the predictors or covariates. In cases like this we may want to know how the treatment effect varies as we look at different values of the covariate, as in the plot below.
# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
Exercise: Can you change Y_CEF
so that
the regression estimate is highly misleading?
Exercise: Can you make the ATE
and the
CATE
provide very different conclusions about the causal
effect of X
?
X <- U -> Y
In this world there is no causal relationship between X
and Y
n <- 1000
U <- rnorm(n)
X <- -2 * U + U^2 + rnorm(n)
Y_CEF <- function(U) 4 - U
Y <- Y_CEF(U) + rnorm(n)
But if we fit regression models we might believe there is a relationship:
lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.8080350 | 0.0423095 | 90.00426 | 0 |
X | 0.2626668 | 0.0139554 | 18.82192 | 0 |
And someone looking at a plot might conclude there is a clear, strong relationship:
qplot(X, Y)
X
# Try changing this
X_new <- X + 1
In this model X <- U -> Y
there are no other
variables on directed pathways out of X
so nothing else is
changed.
U_new <- U
Y_new <- Y_CEF(U_new) + rnorm(n)
All causal effects on Y
are 0
# ATE
mean(Y_new - Y)
## [1] -0.1262453
# CATE
qplot(X, Y_new - Y, geom=c("point"))
X -> Y <- Z
and
Z -> X
Now there is another measured variable Z
which effects
both X
and Y
. We might be only interested (for
science or profit) in the causal relationship between X
and
Y
, or the “direct effect” between Z
and
Y
, or the “total effect” between Z
and
Y
. We’ll consider each of these below.
n <- 500
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y_CEF <- function(X, Z) 4 + 2 * Z - 3 * X
Y <- Y_CEF(X, Z) + rnorm(n)
Regression model outputs. Which of the coefficients below (if any) give the right answer to which causal questions?
lm(Y ~ Z) |> broom::tidy() |> knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -10.955208 | 0.1385322 | -79.08060 | 0 |
Z | 7.792691 | 0.1404004 | 55.50335 | 0 |
lm(Y ~ X) |> broom::tidy() |> knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 7.901970 | 0.1441691 | 54.81045 | 0 |
X | -3.784157 | 0.0263265 | -143.73930 | 0 |
lm(Y ~ X + Z) |> broom::tidy() |> knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.192522 | 0.2327328 | 18.01432 | 0 |
X | -3.038416 | 0.0458332 | -66.29296 | 0 |
Z | 1.827038 | 0.1005229 | 18.17535 | 0 |
X
only# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z
X_new <- X + 1
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)
mean(Y_new - Y)
## [1] -3.029189
CATE at different values of X
# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
CATE at different values of Z
# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
Z
but not X
(direct effect)Note that we are making an exception to our usual rule about
propagating changes in the model. In this case we hold X
fixed because we are interested in understanding only the “direct
effect” of Z
on Y
, and not any change that
would normally also go through X
.
# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z + 1
X_new <- X
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)
mean(Y_new - Y)
## [1] 1.968833
CATE at different values of X
# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
CATE at different values of Z
# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
Z
(total effect)# Original data, copied for re-running
Z <- rnorm(n)
X <- 5 -2 * Z + rnorm(n)
Y <- Y_CEF(X, Z) + rnorm(n)
# Counterfactuals
Z_new <- Z + 1
X_new <- 5 -2 * Z_new + rnorm(n)
Y_new <- Y_CEF(X_new, Z_new) + rnorm(n)
mean(Y_new - Y)
## [1] 8.042156
CATE at different values of X
# Conditional average treatment effect (CATE)
qplot(X, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
CATE at different values of Z
# Conditional average treatment effect (CATE)
qplot(Z, Y_new - Y, geom=c("smooth", "point")) +
geom_hline(yintercept = mean(Y_new - Y))
Note: In this example the generating code is copy/pasted in 3 places, so any change you make you need to make the same change in all 3 code chunks. You might want to copy/paste the entire section and only edit the new copy of it.
Try changing the functions/distributions generating each variable (but without altering the structure of the underlying DAG).
Try to understand when regression coefficients give accurate answers, which causal effects they may be measuring, and when they do not give accurate answers.