08. Penalized Regression and High-Dimensional Data

Motivation

Classical regression breaks down when the number of predictors approaches or exceeds the number of observations.

This is not a software limitation.
It is a statistical one.

High-dimensional settings require explicit regularization, careful resampling, and disciplined interpretation. This tutorial explains why penalized regression exists, how it behaves under guarded resampling, and what fastml does to prevent common failure modes.

The high-dimensional regime

A dataset is effectively high-dimensional when:

  • the number of predictors is large relative to sample size
  • predictors are correlated
  • signal is weak relative to noise

In this regime, unpenalized regression produces:

  • unstable coefficients
  • inflated apparent performance
  • extreme sensitivity to data splits

Perfect in-sample fit is expected — and meaningless.

Why penalization works

Penalized regression modifies the loss function:

  • Ridge (L2) shrinks coefficients toward zero
  • Lasso (L1) shrinks and performs variable selection
  • Elastic net interpolates between ridge and lasso

Penalization reduces variance at the cost of bias.
This trade-off is not optional in high dimensions.

Recreating a high-dimensional example

We simulate a setting with many predictors and limited observations.

library(fastml)
library(dplyr)

set.seed(123)

n  <- 120
p  <- 110      
k  <- 12     
rho <- 0.85  
noise_sd <- 2  

# Correlated predictors:
Z <- rnorm(n)
X <- sapply(seq_len(p), function(j) rho * Z + sqrt(1 - rho^2) * rnorm(n))
X <- as.data.frame(X)
colnames(X) <- paste0("x", seq_len(p))

# Sparse true effects on first k predictors
beta <- c(rep(3, k), rep(0, p - k))

# Outcome
y <- as.numeric(as.matrix(X) %*% beta + rnorm(n, sd = noise_sd))

hd_data <- bind_cols(tibble(y = y), X)

head(hd_data)
# A tibble: 6 × 111
       y      x1      x2     x3      x4      x5     x6      x7      x8      x9
   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
1 -14.4  -0.414  -0.892  -0.585 -0.466   0.0894 -1.03  -0.876  -0.0633  0.527 
2  -5.28 -0.695  -0.460   0.147 -0.0302 -0.210  -1.11   1.22    1.02   -0.0204
3  55.8   1.07    2.11    1.47   2.02    1.31    1.66   1.08    1.41    1.45  
4  -1.67 -0.0750 -0.539   0.600  0.124  -0.739  -0.746  0.0938  0.0846 -0.831 
5  11.5   1.08    0.0156  0.541  0.485   0.526   0.111  0.452   0.161   0.457 
6  54.0   1.11    2.46    1.35   1.87    1.35    1.59   1.44    1.49    0.919 
# ℹ 101 more variables: x10 <dbl>, x11 <dbl>, x12 <dbl>, x13 <dbl>, x14 <dbl>,
#   x15 <dbl>, x16 <dbl>, x17 <dbl>, x18 <dbl>, x19 <dbl>, x20 <dbl>,
#   x21 <dbl>, x22 <dbl>, x23 <dbl>, x24 <dbl>, x25 <dbl>, x26 <dbl>,
#   x27 <dbl>, x28 <dbl>, x29 <dbl>, x30 <dbl>, x31 <dbl>, x32 <dbl>,
#   x33 <dbl>, x34 <dbl>, x35 <dbl>, x36 <dbl>, x37 <dbl>, x38 <dbl>,
#   x39 <dbl>, x40 <dbl>, x41 <dbl>, x42 <dbl>, x43 <dbl>, x44 <dbl>,
#   x45 <dbl>, x46 <dbl>, x47 <dbl>, x48 <dbl>, x49 <dbl>, x50 <dbl>, …

Only five predictors carry signal. The rest are pure noise.

What unpenalized regression does

lm_fit <- fastml(
  data       = hd_data,
  label      = "y",
  algorithms = "linear_reg"
)

summary(lm_fit)

===== fastml Model Summary =====
Task: regression 
Number of Models Trained: 1 
Best Model(s): linear_reg (lm) (rmse: 12.8543828) 

Performance Metrics (Sorted by rmse):

---------------------------------------------- 
Model        Engine  RMSE    R-squared  MAE    
---------------------------------------------- 
linear_reg*  lm      12.854  0.778      10.516 
---------------------------------------------- 
(*Best model)

Best Model hyperparameters:

Model: linear_reg (lm) 
  penalty: 
  mixture: 

This model fits all predictors.

Its apparent performance may look acceptable, but coefficient estimates are unstable and uninterpretable.

Penalized regression with elastic net

pen_fit <- fastml(
  data       = hd_data,
  label      = "y",
  algorithms = "elastic_net"
)

summary(pen_fit)

===== fastml Model Summary =====
Task: regression 
Number of Models Trained: 1 
Best Model(s): elastic_net (glmnet) (rmse: 3.3240705) 

Performance Metrics (Sorted by rmse):

--------------------------------------------- 
Model         Engine  RMSE   R-squared  MAE   
--------------------------------------------- 
elastic_net*  glmnet  3.324  0.987      2.720 
--------------------------------------------- 
(*Best model)

Best Model hyperparameters:

Model: elastic_net (glmnet) 
  penalty: 0.01
  mixture: 0.5

Elastic net performs:

  • shrinkage,
  • implicit feature selection,
  • regularization-aware resampling.

This is not optional tuning.
It is required for validity.

Why Tuning Must Be Guarded

The penalty strength determines how aggressively coefficients are shrunk.

If tuning is performed outside resampling:

  • test data influence penalty selection,
  • performance is inflated,
  • variable selection becomes meaningless.

fastml enforces tuning within resampling folds, preventing this failure mode.

Coefficient Interpretation Is Conditional

In penalized models:

  • coefficient magnitude depends on penalty strength,
  • selection is resampling-dependent,
  • zero coefficients do not imply irrelevance.

Penalized regression identifies predictive structure, not causal effects.

This distinction is routinely violated in applied research.

Stability Matters More Than Sparsity

Variable selection that changes dramatically across folds is unreliable.

fastml exposes fold-level results so instability can be detected rather than hidden.

Sparse models are not inherently superior.
Stable models are.

High-Dimensional Leakage Risks

High-dimensional workflows amplify leakage through:

  • feature screening on full data,
  • global scaling,
  • outcome-informed transformations,
  • post-hoc variable selection.

Each of these produces dramatic but invalid performance gains.

Guarded resampling is essential, not optional.

What fastml Deliberately Does Not Allow

fastml does not allow:

  • global feature selection before resampling,
  • leakage-prone tuning shortcuts,
  • silent reuse of preprocessing statistics,
  • single-split performance reporting.

These are design constraints, not missing features.

Responsible Reporting in High Dimensions

A defensible report should include:

  • sample size and number of predictors,
  • penalty type and tuning strategy,
  • resampling scheme,
  • stability assessment,
  • explicit limits on interpretation.

Claims of “biomarkers” or “important variables” require evidence beyond penalized regression output.

Summary

  • High-dimensional regression is fundamentally different.
  • Penalization is mandatory, not optional.
  • Penalization through elastic net, ridge or lasso balances sparsity and stability.
  • Guarded tuning prevents catastrophic leakage.

Interpretation must be predictive, not causal.