knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
theme_set(theme_minimal(base_size = 14))
set.seed(1)
In this analysis you are encouraged to use a dataset that you found
interesting. Your dataset should have a numeric outcome variable and
multiple predictor variables, and at least one of the predictors should
also be numeric. Your dataset should not have any special structure that
would clearly violate independence assumptions, e.g. observations with
possible spatial or temporal correlation. We previously used the
gapminder
dataset but with a filter
to take a
subset of only one year of data for this reason (so the same country
would not be repeated multiple times in the plot or model).
Note: we also cannot deal with missing values right now so you may
use drop_na()
# Load dataset and pre-process if necessary
library(palmerpenguins)
penguins |>
drop_na() -> my_data # bad syntax, but fun!
Choose one predictor variable and fit a simple linear model.
model_simple <- lm(body_mass_g ~ bill_length_mm, data = my_data)
Use each of the functions below on your model object and try to understand the output:
summary
predict
residuals
coef
confint
plot
Note that many of these are shortcut functions that can be used on
many different types of objects, e.g. models fit by other methods. To
learn more about each function f
above type ?f
in the console, and to see more specifically how the function works on
linear model objects type ?f.lm
,
e.g. ?plot.lm
summary(model_simple)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1759.38 -468.82 27.79 464.20 1641.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 388.845 289.817 1.342 0.181
## bill_length_mm 86.792 6.538 13.276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
## F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
# predict(model_simple)
# residuals(model_simple)
# coef(model_simple)
# confint(model_simple)
plot(model_simple)
Question: How do we interpret diagnostic plots?
Answer: Generally looking for patterns in residuals, normality of their distribution (diagonal line in QQ plot), constant variance (homoscedasticity), and influential points (high-leverage outliers)
my_data |>
ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
Now repeat the above steps but with a model using more than one predictor.
model_multiple <- lm(
body_mass_g ~ bill_length_mm + bill_depth_mm + sex,
data = my_data)
Try using each of the previous functions on your model object. Notice what’s different about the output.
summary(model_multiple)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + sex,
## data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1342.91 -296.54 -5.81 313.79 1228.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6551.44 394.41 16.611 < 2e-16 ***
## bill_length_mm 36.52 5.30 6.891 2.84e-11 ***
## bill_depth_mm -257.31 14.89 -17.278 < 2e-16 ***
## sexmale 923.30 60.72 15.205 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 452.4 on 329 degrees of freedom
## Multiple R-squared: 0.6872, Adjusted R-squared: 0.6844
## F-statistic: 241 on 3 and 329 DF, p-value: < 2.2e-16
The broom
and GGally
packages have some
functions that are useful when used on lm
objects (and also
on some other types of models).
library(broom)
library(GGally)
Use the tidy
and glance
functions from the
broom
package. Compare to the output from
summary
. Check out the documentation ?tidy.lm
.
For nice document formatting you can also try piping |>
the output from tidy()
into the knitr::kable()
function.
tidy(model_multiple)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6551. 394. 16.6 2.00e-45
## 2 bill_length_mm 36.5 5.30 6.89 2.84e-11
## 3 bill_depth_mm -257. 14.9 -17.3 4.67e-48
## 4 sexmale 923. 60.7 15.2 6.46e-40
# options(digits = 2)
tidy(model_multiple, conf.int = TRUE) |>
knitr::kable(digits = 2)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 6551.44 | 394.41 | 16.61 | 0 | 5775.56 | 7327.32 |
bill_length_mm | 36.52 | 5.30 | 6.89 | 0 | 26.09 | 46.95 |
bill_depth_mm | -257.31 | 14.89 | -17.28 | 0 | -286.61 | -228.02 |
sexmale | 923.31 | 60.72 | 15.21 | 0 | 803.85 | 1042.76 |
glance(model_multiple)
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.687 0.684 452. 241. 1.10e-82 3 -2507. 5023. 5042. 6.73e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Now use the ggcoef
and ggnostic
functions
from GGally
.
ggcoef(model_multiple)
ggnostic(model_multiple)
Note: there is also a function ggpairs
useful for
exploratory data analysis, but beware that it can be slow because it
computes many pairwise plots. You may not want to rerun that code many
times. There is an option cache=TRUE
that saves the result
after running it once and then refuses to run it again unless you change
the code chunk. You can also leave out variables you don’t want to
include using select(-var1, -var2, ...)
or choose only some
to include with select(var1, var2, ...)
. Also see useful
helper functions like ?tidyselect::starts_with
.
my_data |>
select(-year, -island) |>
ggpairs(progress = FALSE)
my_data |>
select(contains("_")) |>
ggpairs(progress = FALSE)
Now pick at least one of your numeric predictor variables and include
at least one non-linear transformation. For example you could include a
polynomial in x
of degree d
with
y ~ poly(x, d)
, or include interactions between
x1
and x2
with y ~ x1 * x2
. Use
the glance
, tidy
, and plot
functions again on this new model and note any differences.
model_transformed <- lm(
body_mass_g ~ poly(bill_length_mm, 2) + bill_depth_mm + sex,
data = my_data)
model_transformed |>
tidy() |>
knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 8164.45678 | 251.69363 | 32.4380753 | 0.0000000 |
poly(bill_length_mm, 2)1 | 3634.35772 | 530.78322 | 6.8471601 | 0.0000000 |
poly(bill_length_mm, 2)2 | 47.67023 | 468.53368 | 0.1017434 | 0.9190225 |
bill_depth_mm | -257.70260 | 15.39458 | -16.7398241 | 0.0000000 |
sexmale | 923.73929 | 60.96442 | 15.1521037 | 0.0000000 |
model_transformed |>
glance() |>
knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.687 | 0.683 | 453.044 | 180.194 | 0 | 4 | -2506.612 | 5025.224 | 5048.073 | 67321598 | 328 | 333 |
plot(model_transformed)
The pair plots from e.g. ggnostic
or
ggpairs
show many 2-dimensional projections of the data,
but there is no guarantee that these projections together help us
understand higher dimensional relationships.
Question: What does this mean for diagnostic plots when our regression model is high dimensional (e.g. p > 3 predictors)
Answer: It is possible there is some higher dimensional pattern in the residuals that the pairs plot does not reveal to us. This would mean the model failed to capture some (non-linear) signal and our diagnostic method failed to reveal that flaw to us.
Now we want to simulate data in various ways, fit regression models to the data while pretending we don’t know how it was generated, and then see how those models perform. We want to understand the limitations of regression models. If something was wrong with the model, would we be able to tell?
Choose a functional form for the true outcome data generation process
f <- function(x1, x2, x3) x1 + 2*x2 - x2^2 # change this
Generate several predictor variables, an error term, and an outcome
simulate_data <- function(n) {
x1 <- rnorm(n, mean = 3, sd = 3) # change
x2 <- runif(n, min = 0, max = 2) # change
x3 <- rexp(n, rate = 2) # change
errors <- rnorm(n, sd = 1) # change
y <- f(x1, x2, x3) + errors
data.frame(y = y,
x1 = x1,
x2 = x2,
x3 = x3)
}
Generate one dataset and use ggpairs
to show pair
plots.
n <- 100
sim_data <- simulate_data(n)
ggpairs(sim_data, progress = FALSE)
Repeat the following, making various choices:
tidy
,
glance
, ggcoef
, or ggnostic
plotsmy_model <- lm(y ~ ., data = sim_data)
my_model |> ggcoef()
There are many things you can play with above:
sd
or
using some other centered distribution)Challenge: can you find some potential dangers of using regression models, where the model is wrong in some important way and our diagnostic methods fail to show us the problem?
Challenge: what are some strengths or robustness properties of regression models that we should appreciate and trust in the long run? i.e. are there ways the model can be wrong but our conclusions from it will still (usually?) be good enough to use anyway?