Inference for the lasso

Generate data for a high-dimensional regression

n <- 100
p <- 200
X <- matrix(rnorm(n*p), nrow = n)
Y <- rnorm(n)

Fit the lasso solution path and choose a sparsity level automatically

Note: this is an alternative method to cross-validation

# Compute lasso path
# while storing information required for
# conditional (truncated) inference
lasso_path <- lar(X, Y)
# Use AIC/BIC to choose model size
# and compute adjusted p-values
# in higher dimensions change mult
# to log(n) for BIC, log(p) for RIC
lasso_inference <- larInf(lasso_path, mult = 2, type = "aic", ntimes = 1)
## Warning in larInf(lasso_path, mult = 2, type = "aic", ntimes = 1): p > n/
## 2, and sd(y) = 0.969 used as an estimate of sigma; you may want to use the
## estimateSigma function

Selected predictors

active_set <- lasso_inference$vars
active_set
## [1]  87  53   7 120 139 158   6 196

Fit an OLS model using only selected variables, compute p-values the classical way

(Use summary() or tidy() from the broom package)

OLS_fit <- lm(Y ~ X[, active_set] - 1)
tidy(OLS_fit)
## # A tibble: 8 x 5
##   term             estimate std.error statistic p.value
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>
## 1 X[, active_set]1    0.196    0.0857      2.28 0.0247 
## 2 X[, active_set]2   -0.239    0.0819     -2.92 0.00439
## 3 X[, active_set]3   -0.235    0.0788     -2.99 0.00363
## 4 X[, active_set]4    0.196    0.0814      2.40 0.0182 
## 5 X[, active_set]5   -0.191    0.0836     -2.29 0.0244 
## 6 X[, active_set]6   -0.158    0.0827     -1.91 0.0595 
## 7 X[, active_set]7   -0.155    0.0794     -1.96 0.0531 
## 8 X[, active_set]8   -0.151    0.0730     -2.07 0.0417

Adjusted inference

These p-values (from the larInf function in the selectiveInference package above) are calculated from a conditional distribution, conditional on the model selection

lasso_inference
## 
## Call:
## larInf(obj = lasso_path, type = "aic", mult = 2, ntimes = 1)
## 
## Standard deviation of noise (specified or estimated) sigma = 0.969
## 
## Testing results at step = 8, with alpha = 0.100
## 
##  Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
##   87  0.193   1.815   0.372    -5.087    8.839        0.05       0.05
##   53 -0.230  -2.251   0.232    -9.297    2.413        0.05       0.05
##    7 -0.233  -2.382   0.243      -Inf    4.446        0.00       0.05
##  120  0.207   2.025   0.282    -5.442      Inf        0.05       0.00
##  139 -0.206  -1.949   0.718   -10.145      Inf        0.05       0.00
##  158 -0.161  -1.570   0.889    -3.539      Inf        0.05       0.00
##    6 -0.142  -1.414   0.659      -Inf      Inf        0.00       0.00
##  196 -0.146  -1.603   0.446      -Inf      Inf        0.00       0.00
## 
## Estimated stopping point from AIC rule = 8

How do the two sets of p-values compare? Which are “better”, and why?

Classical inference is biased toward significance. Even though the true coefficients are all zero we get many significant p-values. This is due to model selection bias. (Classical inference methods are based on an assumption of the model being chosen in advance, they become invalid if the model is chosen using the same data)

Compare to p-values calculated on a new set of test data

X_test <- matrix(rnorm(n*p), nrow = n)
Y_test <- rnorm(n)
lm(Y_test ~ X_test[, active_set] - 1) %>% tidy()
## # A tibble: 8 x 5
##   term                  estimate std.error statistic p.value
##   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>
## 1 X_test[, active_set]1 -0.115      0.102    -1.12     0.265
## 2 X_test[, active_set]2  0.0896     0.0939    0.954    0.342
## 3 X_test[, active_set]3 -0.138      0.0930   -1.48     0.142
## 4 X_test[, active_set]4  0.0214     0.104     0.206    0.837
## 5 X_test[, active_set]5 -0.0865     0.0946   -0.915    0.363
## 6 X_test[, active_set]6 -0.0944     0.0892   -1.06     0.293
## 7 X_test[, active_set]7  0.0113     0.0986    0.114    0.909
## 8 X_test[, active_set]8  0.00529    0.104     0.0508   0.960

Inference calculated using test data is does not suffer from the model selection bias because this data wasn’t used to select the model

Rerun the code several times and observe the variability

The model changes each time, and so do the p-values, but the qualitative conclusion remains the same:

  • Classical inference methods are biased if they are computed on the same data that was used to select the model
  • Inference on test data can avoid this problem
  • Specialized inference methods can correct the bias without needing more test data (advanced topic)

Change Y to be generated from a coefficient vector where only the first 5 entries are nonzero and then run the code again a few more times

X <- matrix(rnorm(n*p), nrow = n)
beta <- c(rep(1, 5), rep(0, p - 5))
Y <- X %*% beta + rnorm(n)
X_test <- matrix(rnorm(n*p), nrow = n)
Y_test <- X_test %*% beta + rnorm(n)
lm(Y_test ~ X_test[, active_set] - 1) %>% tidy()
## # A tibble: 8 x 5
##   term                  estimate std.error statistic p.value
##   <chr>                    <dbl>     <dbl>     <dbl>   <dbl>
## 1 X_test[, active_set]1   0.0132     0.252    0.0524  0.958 
## 2 X_test[, active_set]2   0.262      0.263    0.995   0.322 
## 3 X_test[, active_set]3   0.279      0.249    1.12    0.265 
## 4 X_test[, active_set]4  -0.448      0.259   -1.73    0.0876
## 5 X_test[, active_set]5   0.261      0.250    1.04    0.300 
## 6 X_test[, active_set]6   0.107      0.254    0.422   0.674 
## 7 X_test[, active_set]7   0.241      0.260    0.925   0.357 
## 8 X_test[, active_set]8  -0.359      0.273   -1.32    0.192