Motivation
In the earlier tutorials, we restricted the analysis to complete cases using na.omit().
This choice was intentional and explicitly justified as a pedagogical simplification.
However, complete-case analysis is rarely appropriate in applied biomedical settings.
Missing data are the rule, not the exception.
This tutorial shows how missing data can be handled under guarded resampling , without re-introducing leakage.
Why missing data are dangerous for evaluation
Imputation is a data-dependent operation.
If imputation parameters (means, medians, model-based estimates) are learned using the full dataset before resampling , then information from assessment folds leaks into training folds.
This is a textbook example of the timing problem described in C1. What Is Data Leakage .
The solution is not “better imputation”, but correct placement of imputation inside resampling .
The key constraint
Under guarded resampling (C3 ):
imputation must be learned within each resampling split
imputation parameters must never be shared across folds
assessment data must not influence imputation
fastml enforces this automatically.
Data with missing values
We return to the original Breast Cancer dataset without removing incomplete rows.
library (fastml)
library (mlbench)
library (dplyr)
data (BreastCancer)
breastCancer <- BreastCancer %>%
select (- "Id" ) %>%
mutate (Class = factor (Class, levels = c ("benign" , "malignant" )))
head (breastCancer)
Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
1 5 1 1 1 2 1
2 5 4 4 5 7 10
3 3 1 1 1 2 2
4 6 8 8 1 3 4
5 4 1 1 3 2 1
6 8 10 10 8 7 10
Bl.cromatin Normal.nucleoli Mitoses Class
1 3 1 1 benign
2 3 2 1 benign
3 3 1 1 benign
4 3 7 1 benign
5 3 1 1 benign
6 9 7 1 malignant
At this stage, missing values are present in several predictors.
Specifying guarded imputation
Imputation is specified declaratively.
Here we use median imputation for numeric predictors.
fit <- fastml (
data = breastCancer,
label = "Class" ,
algorithms = c ("rand_forest" , "xgboost" ),
impute_method = "medianImpute" ,
)
No imputation is performed at this point.
The imputation strategy is recorded and applied inside each resampling split.
What Happens Under the Hood (Conceptual)
For each resampling fold:
training data are isolated,
imputation parameters are estimated using the training data only,
the same parameters are applied to the corresponding assessment data,
the model is trained and evaluated within that fold.
Under the guarded resampling execution path, assessment observations are not used when estimating imputation parameters.
This behavior is not a convention assumed of the user.
It is implemented structurally within the guarded resampling workflow.
Fold-level behavior
fit$ resampling_results$ ` rand_forest (ranger) ` $ folds
# A tibble: 70 × 4
fold .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 1 accuracy binary 0.965
2 1 kap binary 0.925
3 1 sens binary 0.946
4 1 spec binary 1
5 1 precision binary 1
6 1 f_meas binary 0.972
7 1 roc_auc binary 0.989
8 2 accuracy binary 0.912
9 2 kap binary 0.818
10 2 sens binary 0.865
# ℹ 60 more rows
Missingness can increase variability across resampling folds.
This variability is expected under resampling-based evaluation and reflects uncertainty induced by incomplete data and finite sample size, rather than a failure of the modeling procedure.
What fastml Does Not Allow Here
Consistent with C4. What fastml Deliberately Does Not Allow , users cannot:
impute the full dataset prior to resampling,
reuse imputation parameters across folds,
mix imputation strategies across models,
detach imputation from evaluation.
Any such workflow would invalidate the evaluation.
What This Tutorial Does Not Cover
This tutorial intentionally does not cover:
multiple imputation (MICE),
missing-not-at-random (MNAR) modeling,
sensitivity analyses,
causal assumptions about missingness.
These require additional assumptions and belong to advanced workflows.
Responsible Interpretation
When missing data are present:
performance estimates typically decrease,
variability typically increases,
small model differences become less meaningful.
These are not failures of the model or the framework.
They are consequences of reduced information.
Summary
Missing data handling is a common source of leakage.
Imputation must be learned within resampling splits.
fastml enforces guarded imputation by design.
Increased uncertainty is expected and informative.
Complete-case analysis is a convenience, not a default.