After running library(gapminder)
you can access the
dataset. Try typing View(gapminder)
in the Console window
below.
gapminder
scatterplot
(geom_point
) using data from the year 2002First create an object with the right subset of data.
(hint: remember the filter
function)
gm2002 <- gapminder %>%
filter(year == 2002)
Now use this data to create the scatterplot.
gm_scatterplot <-
ggplot(gm2002,
aes(x = gdpPercap, y = lifeExp)) +
geom_point()
gm_scatterplot
lm
model to predict lifeExp
from
gdpPercap
model_lm <- lm(lifeExp ~ gdpPercap, data = gm2002)
predictions_lm <- augment(model_lm)
loess
model to do the samemodel_loess <- loess(lifeExp ~ gdpPercap, span = .75, data = gm2002)
predictions_loess <- augment(model_loess)
gm_scatterplot +
geom_line(data = predictions_lm, size = 1,
color = "blue",
linetype = "dashed",
aes(x = gdpPercap, y = .fitted)) +
geom_line(data = predictions_loess, size = 1,
color = "green",
aes(x = gdpPercap, y = .fitted))
Now try changing span
in the loess()
function to some other number (strictly) between 0 and 1, and re-run the
code chunks. First re-run the code chunk that fits the model with
loess()
, then re-run the chunk that plots the predictions.
Observe how the fitted curve is different.
residuals()
of each
model.mean(residuals(model_lm)^2)
## [1] 80.11716
mean(residuals(model_loess)^2)
## [1] 45.21583
Which model is better? Why?
Models are supposed to capture/use structure in the data that corresponds to structure in the real world. And if the real world isn’t misbehaving, that structure should be somewhat stable.
For example, suppose the relationship changed dramatically from one time period to another time period. Then it would be less useful/interesting to have a model fit on data at one time period, because the same model might have a poor fit on data from a different time period.
Let’s explore this with our gapminder
models
Create a dataset for the year 1997
gm1997 <- gapminder %>% filter(year == 1997)
Get predictions for 1997 from both models by predicting with the
newdata
argument
lm_predict1997 <- augment(model_lm, newdata = gm1997)
loess_predict1997 <- augment(model_loess, newdata = gm1997)
lm_predict1997 |>
summarize(MSE = mean(.resid^2))
## # A tibble: 1 × 1
## MSE
## <dbl>
## 1 67.2
loess_predict1997 |>
summarize(MSE = mean(.resid^2))
## # A tibble: 1 × 1
## MSE
## <dbl>
## 1 31.6
Question: Is it surprising which model does better? Why or why not?
The more complex, loess
model performs better than the
linear model even when tested on different data. Apparently the
association between gdpPercap
and lifeExp
is
fairly stable over time.
Choose a function. Be creative! Type ?sqrt
in the
Console. Also try ?log
and ?sin
.
f <- function(x) x^2 # change this
Come back here, change things, and re-run the code chunks below to experiment.
Type ?Distributions
in the Console to see other random
variables you can choose from. The function name starting with
r
generates a random sample, e.g. rexp
generates a sample from an exponential random variable.
simulate_data <- function(n) {
x <- runif(n, min = 0, max = 3) # change
errors <- rnorm(n, sd = 1) # change
y <- f(x) + errors
data.frame(x = x, y = y)
}
Generate and plot data
n <- 100
sim_data <- simulate_data(n)
sim_plot <- ggplot(sim_data, aes(x = x, y = y)) +
geom_point()
sim_plot
lm
and loess
models to the simulated
data and plot them(Hint: copy and paste your earlier code, change gm2002
and other variable names)
model_sim_lm <- lm(y ~ x, data = sim_data)
predictions_sim_lm <- augment(model_sim_lm)
model_sim_loess <- loess(y ~ x, span = .75, data = sim_data)
predictions_sim_loess <- augment(model_sim_loess)
sim_plot +
geom_line(data = predictions_sim_lm, size = 1,
color = "blue",
linetype = "dashed",
aes(x = x, y = .fitted)) +
geom_line(data = predictions_sim_loess, size = 1,
color = "green",
linetype = "dotdash",
aes(x = x, y = .fitted)) +
geom_function(fun = f, color = "purple", size = 1)
coef(model_sim_lm)
## (Intercept) x
## -1.924262 3.309123
coef(model_sim_lm)[2]
## x
## 3.309123
Question: How is this slope related to the true
function f
?
Answer: If the true model is linear then we could
estimate the true slope. But if the true model is not linear, we may be
estimating something only weakly related to the true relationship, like
a weighted average of the slopes over the range of x
.
n_iters <- 5
simulate_slope <- function(n) {
new_data <- simulate_data(n)
new_model <- lm(y ~ x, new_data)
coef(new_model)[2]
}
replicate(n_iters, simulate_slope(n))
## x x x x x
## 2.782417 2.922220 2.929165 3.071938 2.948885
n_iters <- 200
many_slopes <- replicate(n_iters, simulate_slope(n))
qplot(many_slopes, bins = 20)
Question: What do you notice about this distribution, and why?
Answer: It seems approximately normal. This happens because of the central limit theorem.
Go back and change some inputs in previous code chunks and re-run all the chunks after that change.
Can you find anything that seems like it makes an important, qualitative difference in the conclusions?