04. Missing Data and Guarded Imputation

Motivation

In the earlier tutorials, we restricted the analysis to complete cases using na.omit().

This choice was intentional and explicitly justified as a pedagogical simplification.

However, complete-case analysis is rarely appropriate in applied biomedical settings.
Missing data are the rule, not the exception.

This tutorial shows how missing data can be handled under guarded resampling, without re-introducing leakage.


Why missing data are dangerous for evaluation

Imputation is a data-dependent operation.

If imputation parameters (means, medians, model-based estimates) are learned using the full dataset before resampling, then information from assessment folds leaks into training folds.

This is a textbook example of the timing problem described in C1. What Is Data Leakage.

The solution is not “better imputation”, but correct placement of imputation inside resampling.


The key constraint

Under guarded resampling (C3):

  • imputation must be learned within each resampling split
  • imputation parameters must never be shared across folds
  • assessment data must not influence imputation

fastml enforces this automatically.


Data with missing values

We return to the original Breast Cancer dataset without removing incomplete rows.

library(fastml)
library(mlbench)
library(dplyr)

data(BreastCancer)

breastCancer <- BreastCancer %>%
  select(-"Id") %>%
  mutate(Class = factor(Class, levels = c("benign", "malignant")))

head(breastCancer)
  Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
1            5         1          1             1            2           1
2            5         4          4             5            7          10
3            3         1          1             1            2           2
4            6         8          8             1            3           4
5            4         1          1             3            2           1
6            8        10         10             8            7          10
  Bl.cromatin Normal.nucleoli Mitoses     Class
1           3               1       1    benign
2           3               2       1    benign
3           3               1       1    benign
4           3               7       1    benign
5           3               1       1    benign
6           9               7       1 malignant

At this stage, missing values are present in several predictors.

Specifying guarded imputation

Imputation is specified declaratively.

Here we use median imputation for numeric predictors.

fit <- fastml(
  data          = breastCancer,
  label         = "Class",
  algorithms    = c("rand_forest", "xgboost"),
  impute_method = "medianImpute",
)

No imputation is performed at this point.

The imputation strategy is recorded and applied inside each resampling split.

What Happens Under the Hood (Conceptual)

For each resampling fold:

  • training data are isolated,
  • imputation parameters are estimated using the training data only,
  • the same parameters are applied to the corresponding assessment data,
  • the model is trained and evaluated within that fold.

Under the guarded resampling execution path, assessment observations are not used when estimating imputation parameters.

This behavior is not a convention assumed of the user.
It is implemented structurally within the guarded resampling workflow.

Examining performance

fit$resampling_results$`rand_forest (ranger)`$aggregated
# A tibble: 7 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.959
2 f_meas    binary         0.968
3 kap       binary         0.911
4 precision binary         0.981
5 roc_auc   binary         0.989
6 sens      binary         0.956
7 spec      binary         0.963

These metrics are now based on:

  • incomplete data
  • fold-specific imputation
  • leakage-safe evaluation

Performance differences relative to complete-case analysis are expected and interpretable.

Fold-level behavior

fit$resampling_results$`rand_forest (ranger)`$folds
# A tibble: 70 × 4
   fold  .metric   .estimator .estimate
   <chr> <chr>     <chr>          <dbl>
 1 1     accuracy  binary         0.965
 2 1     kap       binary         0.925
 3 1     sens      binary         0.946
 4 1     spec      binary         1    
 5 1     precision binary         1    
 6 1     f_meas    binary         0.972
 7 1     roc_auc   binary         0.989
 8 2     accuracy  binary         0.912
 9 2     kap       binary         0.818
10 2     sens      binary         0.865
# ℹ 60 more rows

Missingness can increase variability across resampling folds.

This variability is expected under resampling-based evaluation and reflects uncertainty induced by incomplete data and finite sample size, rather than a failure of the modeling procedure.

What fastml Does Not Allow Here

Consistent with C4. What fastml Deliberately Does Not Allow, users cannot:

  • impute the full dataset prior to resampling,
  • reuse imputation parameters across folds,
  • mix imputation strategies across models,
  • detach imputation from evaluation.

Any such workflow would invalidate the evaluation.

What This Tutorial Does Not Cover

This tutorial intentionally does not cover:

  • multiple imputation (MICE),
  • missing-not-at-random (MNAR) modeling,
  • sensitivity analyses,
  • causal assumptions about missingness.

These require additional assumptions and belong to advanced workflows.

Responsible Interpretation

When missing data are present:

  • performance estimates typically decrease,
  • variability typically increases,
  • small model differences become less meaningful.

These are not failures of the model or the framework.
They are consequences of reduced information.

Summary

  • Missing data handling is a common source of leakage.
  • Imputation must be learned within resampling splits.
  • fastml enforces guarded imputation by design.
  • Increased uncertainty is expected and informative.
  • Complete-case analysis is a convenience, not a default.

What comes next

05. Survival Analysis and Time-to-Event Outcomes
Survival outcomes and time-to-event evaluation under guarded resampling.