08. Penalized Regression and High-Dimensional Data
Motivation
Classical regression breaks down when the number of predictors approaches or exceeds the number of observations.
This is not a software limitation.
It is a statistical one.
High-dimensional settings require explicit regularization, careful resampling, and disciplined interpretation. This tutorial explains why penalized regression exists, how it behaves under guarded resampling, and what fastml does to prevent common failure modes.
The high-dimensional regime
A dataset is effectively high-dimensional when:
the number of predictors is large relative to sample size
predictors are correlated
signal is weak relative to noise
In this regime, unpenalized regression produces:
unstable coefficients
inflated apparent performance
extreme sensitivity to data splits
Perfect in-sample fit is expected — and meaningless.
Why penalization works
Penalized regression modifies the loss function:
Ridge (L2) shrinks coefficients toward zero
Lasso (L1) shrinks and performs variable selection
Elastic net interpolates between ridge and lasso
Penalization reduces variance at the cost of bias.
This trade-off is not optional in high dimensions.
Recreating a high-dimensional example
We simulate a setting with many predictors and limited observations.
library(fastml)library(dplyr)set.seed(123)n <-120p <-110k <-12rho <-0.85noise_sd <-2# Correlated predictors:Z <-rnorm(n)X <-sapply(seq_len(p), function(j) rho * Z +sqrt(1- rho^2) *rnorm(n))X <-as.data.frame(X)colnames(X) <-paste0("x", seq_len(p))# Sparse true effects on first k predictorsbeta <-c(rep(3, k), rep(0, p - k))# Outcomey <-as.numeric(as.matrix(X) %*% beta +rnorm(n, sd = noise_sd))hd_data <-bind_cols(tibble(y = y), X)head(hd_data)