knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
theme_set(theme_minimal(base_size = 14))
set.seed(1)

Regression models with real data

In this analysis you are encouraged to use a dataset that you found interesting. Your dataset should have a numeric outcome variable and multiple predictor variables, and at least one of the predictors should also be numeric. Your dataset should not have any special structure that would clearly violate independence assumptions, e.g. observations with possible spatial or temporal correlation. We previously used the gapminder dataset but with a filter to take a subset of only one year of data for this reason (so the same country would not be repeated multiple times in the plot or model).

Note: we also cannot deal with missing values right now so you may use drop_na()

# Load dataset and pre-process if necessary
library(palmerpenguins)
penguins |> 
  drop_na() -> my_data # bad syntax, but fun!

Simple linear regression

Choose one predictor variable and fit a simple linear model.

model_simple <- lm(body_mass_g ~ bill_length_mm, data = my_data)

Use each of the functions below on your model object and try to understand the output:

  • summary
  • predict
  • residuals
  • coef
  • confint
  • plot

Note that many of these are shortcut functions that can be used on many different types of objects, e.g. models fit by other methods. To learn more about each function f above type ?f in the console, and to see more specifically how the function works on linear model objects type ?f.lm, e.g. ?plot.lm

summary(model_simple)
## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1759.38  -468.82    27.79   464.20  1641.00 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     388.845    289.817   1.342    0.181    
## bill_length_mm   86.792      6.538  13.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3455 
## F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16
# predict(model_simple)
# residuals(model_simple)
# coef(model_simple)
# confint(model_simple)
plot(model_simple)

Question: How do we interpret diagnostic plots?

Answer: Generally looking for patterns in residuals, normality of their distribution (diagonal line in QQ plot), constant variance (homoscedasticity), and influential points (high-leverage outliers)

Exercise: Create a scatterplot showing your simple linear model

my_data |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

Multiple linear regression

Now repeat the above steps but with a model using more than one predictor.

model_multiple <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + sex,
  data = my_data)

Try using each of the previous functions on your model object. Notice what’s different about the output.

summary(model_multiple)
## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + sex, 
##     data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1342.91  -296.54    -5.81   313.79  1228.82 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6551.44     394.41  16.611  < 2e-16 ***
## bill_length_mm    36.52       5.30   6.891 2.84e-11 ***
## bill_depth_mm   -257.31      14.89 -17.278  < 2e-16 ***
## sexmale          923.30      60.72  15.205  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 452.4 on 329 degrees of freedom
## Multiple R-squared:  0.6872, Adjusted R-squared:  0.6844 
## F-statistic:   241 on 3 and 329 DF,  p-value: < 2.2e-16

Some helpful packages

The broom and GGally packages have some functions that are useful when used on lm objects (and also on some other types of models).

library(broom)
library(GGally)

Use the tidy and glance functions from the broom package. Compare to the output from summary. Check out the documentation ?tidy.lm. For nice document formatting you can also try piping |> the output from tidy() into the knitr::kable() function.

tidy(model_multiple)
## # A tibble: 4 × 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      6551.     394.       16.6  2.00e-45
## 2 bill_length_mm     36.5      5.30      6.89 2.84e-11
## 3 bill_depth_mm    -257.      14.9     -17.3  4.67e-48
## 4 sexmale           923.      60.7      15.2  6.46e-40
# options(digits = 2)
tidy(model_multiple, conf.int = TRUE) |>
  knitr::kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 6551.44 394.41 16.61 0 5775.56 7327.32
bill_length_mm 36.52 5.30 6.89 0 26.09 46.95
bill_depth_mm -257.31 14.89 -17.28 0 -286.61 -228.02
sexmale 923.31 60.72 15.21 0 803.85 1042.76
glance(model_multiple)
## # A tibble: 1 × 12
##   r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
##       <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.687        0.684  452.    241. 1.10e-82     3 -2507. 5023. 5042.  6.73e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

Now use the ggcoef and ggnostic functions from GGally.

ggcoef(model_multiple)

ggnostic(model_multiple)

Note: there is also a function ggpairs useful for exploratory data analysis, but beware that it can be slow because it computes many pairwise plots. You may not want to rerun that code many times. There is an option cache=TRUE that saves the result after running it once and then refuses to run it again unless you change the code chunk. You can also leave out variables you don’t want to include using select(-var1, -var2, ...) or choose only some to include with select(var1, var2, ...). Also see useful helper functions like ?tidyselect::starts_with.

my_data |>
  select(-year, -island) |>
  ggpairs(progress = FALSE)

my_data |>
  select(contains("_")) |>
  ggpairs(progress = FALSE)

Including non-linear transformed predictors

Now pick at least one of your numeric predictor variables and include at least one non-linear transformation. For example you could include a polynomial in x of degree d with y ~ poly(x, d), or include interactions between x1 and x2 with y ~ x1 * x2. Use the glance, tidy, and plot functions again on this new model and note any differences.

model_transformed <- lm(
  body_mass_g ~ poly(bill_length_mm, 2) + bill_depth_mm + sex,
  data = my_data)
model_transformed |>
  tidy() |>
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 8164.45678 251.69363 32.4380753 0.0000000
poly(bill_length_mm, 2)1 3634.35772 530.78322 6.8471601 0.0000000
poly(bill_length_mm, 2)2 47.67023 468.53368 0.1017434 0.9190225
bill_depth_mm -257.70260 15.39458 -16.7398241 0.0000000
sexmale 923.73929 60.96442 15.1521037 0.0000000
model_transformed |>
  glance() |>
  knitr::kable(digits = 3)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.687 0.683 453.044 180.194 0 4 -2506.612 5025.224 5048.073 67321598 328 333
plot(model_transformed)

Problem: high dimensional plots

The pair plots from e.g. ggnostic or ggpairs show many 2-dimensional projections of the data, but there is no guarantee that these projections together help us understand higher dimensional relationships.

Question: What does this mean for diagnostic plots when our regression model is high dimensional (e.g. p > 3 predictors)

Answer: It is possible there is some higher dimensional pattern in the residuals that the pairs plot does not reveal to us. This would mean the model failed to capture some (non-linear) signal and our diagnostic method failed to reveal that flaw to us.

Regression models with simulated data

Now we want to simulate data in various ways, fit regression models to the data while pretending we don’t know how it was generated, and then see how those models perform. We want to understand the limitations of regression models. If something was wrong with the model, would we be able to tell?

Choose a functional form for the true outcome data generation process

f <- function(x1, x2, x3) x1 + 2*x2 - x2^2 # change this

Generate several predictor variables, an error term, and an outcome

simulate_data <- function(n) {
  x1 <- rnorm(n, mean = 3, sd = 3) # change
  x2 <- runif(n, min = 0, max = 2) # change
  x3 <- rexp(n, rate = 2) # change
  errors <- rnorm(n, sd = 1) # change
  y <- f(x1, x2, x3) + errors
  data.frame(y = y,
             x1 = x1,
             x2 = x2,
             x3 = x3)
}

Generate one dataset and use ggpairs to show pair plots.

n <- 100
sim_data <- simulate_data(n)
ggpairs(sim_data, progress = FALSE)

Repeat the following, making various choices:

  1. Fit a linear regression model using some choice of predictors, possibly with non-linear transformations as well
  2. Look at the output from functions like tidy, glance, ggcoef, or ggnostic plots
  3. Consider whether the model is right/wrong and whether the output reveals the truth about the model
my_model <- lm(y ~ ., data = sim_data)
my_model |> ggcoef()

Experiment!

There are many things you can play with above:

  • number and distributions of predictor variables
  • form of the true outcome generating function
  • distribution of the noise (either just changing sd or using some other centered distribution)
  • sample size
  • choice of fitted model
  • which aspect of the model to interpret or check against the truth

Challenge: can you find some potential dangers of using regression models, where the model is wrong in some important way and our diagnostic methods fail to show us the problem?

Challenge: what are some strengths or robustness properties of regression models that we should appreciate and trust in the long run? i.e. are there ways the model can be wrong but our conclusions from it will still (usually?) be good enough to use anyway?