library(fastml)
library(mlbench)
library(dplyr)
data(BreastCancer)
breastCancer <- BreastCancer %>%
select(-Id) %>%
na.omit() %>%
mutate(Class = factor(Class, levels = c("benign", "malignant")))
fit <- fastml(
data = breastCancer,
label = "Class",
algorithms = c("rand_forest", "xgboost"),
)03. Metrics, Variability, and Uncertainty
Motivation
Machine learning results are often summarized by a single number.
This practice is convenient, but it is rarely sufficient.
Performance metrics are random variables, not fixed properties of a model.
They depend on how data are split, how models are trained, and how evaluation is conducted.
As shown in Tutorials 01–02, fastml enforces valid evaluation.
This tutorial focuses on how to interpret the resulting metrics responsibly.
Recreating the analysis
Each tutorial in this series is designed to be fully self-contained.
We therefore recreate the analysis from Tutorials 01–02 before examining metric behavior.
Metrics Are Estimates, Not Truths
A performance metric such as accuracy or ROC AUC is an estimate of out-of-sample performance rather than a fixed quantity.
Its value depends on:
- the resampling scheme,
- the specific data splits,
- random elements in model fitting.
Reporting only a single point estimate obscures this variability.
Aggregated Metrics
fastml reports metrics aggregated across resampling folds.
fit$resampling_results$`rand_forest (ranger)`$aggregated# A tibble: 7 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.962
2 f_meas binary 0.970
3 kap binary 0.916
4 precision binary 0.975
5 roc_auc binary 0.991
6 sens binary 0.966
7 spec binary 0.953
These values summarize performance under guarded cross-validation.
They answer the question:
What is the average performance of this model under this evaluation design?
They do not answer:
How stable is this performance?
Fold-Level Variability
To assess performance stability, fold-level results must be examined.
fit$resampling_results$`rand_forest (ranger)`$folds# A tibble: 70 × 4
fold .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 1 accuracy binary 0.929
2 1 kap binary 0.844
3 1 sens binary 0.944
4 1 spec binary 0.9
5 1 precision binary 0.944
6 1 f_meas binary 0.944
7 1 roc_auc binary 0.968
8 2 accuracy binary 0.964
9 2 kap binary 0.922
10 2 sens binary 0.944
# ℹ 60 more rows
Fold-level metrics reveal:
- sensitivity to data partitioning,
- variability across assessment sets,
- overlap between competing models.
Large variability indicates that small differences in average performance are unlikely to be meaningful.
Variability and Model Comparison
In Tutorial 02, models were compared using shared resampling.
Even under fair comparison, it is common to observe:
- overlapping fold-level performance,
- different rankings across folds,
- instability in “best model” selection.
This is not a flaw.
It reflects finite-sample uncertainty.
Beyond Point Estimates
Point estimates encourage overinterpretation.
Two models with accuracies of 0.82 and 0.84 may appear different, but without information on variability, this difference is not interpretable.
For this reason, fastml supports uncertainty-aware performance summaries.
Bootstrap-Based Metric Uncertainty
When enabled, fastml computes bootstrap distributions of evaluation metrics.
fit$performance$rand_forest$ranger
# A tibble: 7 × 6
.metric .estimator .estimate .lower .upper .n_boot
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 accuracy binary 0.978 0.949 1 500
2 kap binary 0.953 0.893 1 500
3 sens binary 0.966 0.920 1 500
4 spec binary 1 1 1 500
5 precision binary 1 1 1 500
6 f_meas binary 0.983 0.958 1 500
7 roc_auc binary 0.996 0.990 1 500
fit$performance$xgboost$xgboost
# A tibble: 7 × 6
.metric .estimator .estimate .lower .upper .n_boot
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 accuracy binary 0.978 0.949 1 500
2 kap binary 0.952 0.888 1 500
3 sens binary 0.978 0.942 1 500
4 spec binary 0.979 0.929 1 500
5 precision binary 0.989 0.963 1 500
6 f_meas binary 0.983 0.959 1 500
7 roc_auc binary 0.997 0.992 1 500
Bootstrapping provides:
- empirical uncertainty estimates,
- insight into metric dispersion,
- a basis for interval-based reporting.
These intervals describe evaluation uncertainty rather than population-level inference.
What fastml Does Not Claim
fastml does not:
- perform hypothesis testing between models,
- assign statistical significance to metric differences,
- provide p-values for model superiority.
Such procedures require additional assumptions and lie outside the scope of guarded resampling.
Responsible Reporting
A defensible performance report should include:
- the primary evaluation metric,
- the resampling scheme,
- aggregated performance estimates,
- fold-level variability or uncertainty intervals.
Single-number summaries are insufficient.
Summary
Metrics are random variables.
Point estimates hide variability.
Fold-level results reveal stability and overlap.
Bootstrap methods provide uncertainty estimates.
fastml emphasizes defensible interpretation over ranking.
What comes next
04. Missing Data and Guarded Imputation
Handling missing data under guarded resampling.