set.seed(1) # delete this line
n <- 100
p <- 200
X <- matrix(rnorm(n*p), nrow = n)
Y <- rnorm(n)
Note: this is an alternative method to cross-validation
# Compute lasso path
# while storing information required for
# conditional (truncated) inference
lasso_path <- lar(X, Y)
# Use AIC/BIC to choose model size
# and compute adjusted p-values
# in higher dimensions change mult
# to log(n) for BIC, log(p) for RIC
sigma_hat <- estimateSigma(X, Y)$sigmahat
lasso_inference <- larInf(lasso_path,
sigma = sigma_hat,
mult = 2,
type = "aic",
ntimes = 1)
Selected predictors
active_set <- lasso_inference$vars
active_set
## [1] 56 174 92 114
(Use summary()
or tidy()
from the
broom
package)
OLS_fit <- lm(Y ~ X[, active_set] - 1)
tidy(OLS_fit)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 X[, active_set]1 -0.337 0.0768 -4.39 0.0000294
## 2 X[, active_set]2 -0.284 0.0831 -3.42 0.000917
## 3 X[, active_set]3 -0.242 0.0893 -2.71 0.00793
## 4 X[, active_set]4 0.202 0.0826 2.44 0.0164
These p-values (from the larInf
function in the
selectiveInference
package above) are calculated from a
conditional distribution, conditional on the model selection
lasso_inference
##
## Call:
## larInf(obj = lasso_path, sigma = sigma_hat, type = "aic", mult = 2,
## ntimes = 1)
##
## Standard deviation of noise (specified or estimated) sigma = 0.923
##
## Testing results at step = 4, with alpha = 0.100
##
## Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
## 56 -0.336 -3.818 0.495 -1.440 1.864 0.05 0.05
## 174 -0.284 -3.003 0.737 -5.998 Inf 0.05 0.00
## 92 -0.242 -2.385 0.485 -Inf Inf 0.00 0.00
## 114 0.202 2.147 0.680 -Inf 5.084 0.00 0.05
##
## Estimated stopping point from AIC rule = 4
Classical inference is biased toward significance. Even though the true coefficients are all zero we get many significant p-values. This is due to model selection bias. (Classical inference methods are based on an assumption of the model being chosen in advance, they become invalid if the model is chosen using the same data)
X_test <- matrix(rnorm(n*p), nrow = n)
Y_test <- rnorm(n)
lm(Y_test ~ X_test[, active_set] - 1) %>% tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 X_test[, active_set]1 0.0295 0.101 0.293 0.771
## 2 X_test[, active_set]2 -0.0325 0.103 -0.317 0.752
## 3 X_test[, active_set]3 -0.0197 0.110 -0.179 0.858
## 4 X_test[, active_set]4 -0.0171 0.0983 -0.174 0.863
Inference calculated using test data is does not suffer from the model selection bias because this data wasn’t used to select the model
The model changes each time, and so do the p-values, but the qualitative conclusion remains the same:
Y
to be generated from a coefficient vector
where only the first 5 entries are nonzero and then run the code again a
few more timesX <- matrix(rnorm(n*p), nrow = n)
beta <- c(rep(1, 5), rep(0, p - 5))
Y <- X %*% beta + rnorm(n)
X_test <- matrix(rnorm(n*p), nrow = n)
Y_test <- X_test %*% beta + rnorm(n)
lm(Y_test ~ X_test[, active_set] - 1) %>% tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 X_test[, active_set]1 0.0464 0.245 0.189 0.850
## 2 X_test[, active_set]2 0.215 0.225 0.954 0.342
## 3 X_test[, active_set]3 0.347 0.255 1.36 0.176
## 4 X_test[, active_set]4 0.186 0.260 0.717 0.475