02. Multiple Models and Fair Comparison

Motivation

Comparing machine learning models is a central goal of applied analysis.

However, many model comparisons are invalid—not because the models are wrong, but because they are evaluated under different data partitions, different preprocessing, or different sources of randomness.

As discussed in C3 — Guarded Resampling, fair comparison requires more than using the same metric.


What “fair comparison” means

A comparison is fair only if:

  • all models see the same resampling splits
  • preprocessing is learned independently within each split
  • metrics are computed on identical assessment sets
  • differences reflect models, not data partitioning

fastml enforces these conditions along its guarded resampling execution path, ensuring fair comparison when models are evaluated within a single fastml call.


Data

We use a complete-case version of the BreastCancer dataset after removing the identifier column and excluding observations with missing values. This complete-case strategy is used for simplicity in this tutorial; in real applications, missingness should be handled explicitly (e.g., via guarded imputation) when it is nontrivial or informative.

library(fastml)
library(mlbench)
library(dplyr)

data(BreastCancer)
dim(BreastCancer)
[1] 699  11
breastCancer <- BreastCancer %>%
  select(-Id) %>%
  na.omit() %>%
  mutate(Class = factor(Class, levels = c("benign", "malignant")))
  
dim(breastCancer)
[1] 683  10
head(breastCancer)
  Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
1            5         1          1             1            2           1
2            5         4          4             5            7          10
3            3         1          1             1            2           2
4            6         8          8             1            3           4
5            4         1          1             3            2           1
6            8        10         10             8            7          10
  Bl.cromatin Normal.nucleoli Mitoses     Class
1           3               1       1    benign
2           3               2       1    benign
3           3               1       1    benign
4           3               7       1    benign
5           3               1       1    benign
6           9               7       1 malignant

Defining Multiple Models

We compare three structurally different models:

  • Logistic regression (linear)
  • Random forest (tree ensemble)
  • Gradient boosted trees

No hyperparameter tuning is performed.
Default engine settings are used intentionally. This isolates differences due to model structure rather than tuning effort, which is essential for a fair baseline comparison.

fit <- fastml(
  data       = breastCancer,
  label      = "Class",
  algorithms = c("logistic_reg", "rand_forest", "xgboost")
)

Shared resampling plan

All models are evaluated under a single, shared resampling plan.

fit$resampling_plan
$splits
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [490/56]> Fold01
 2 <split [491/55]> Fold02
 3 <split [491/55]> Fold03
 4 <split [491/55]> Fold04
 5 <split [491/55]> Fold05
 6 <split [492/54]> Fold06
 7 <split [492/54]> Fold07
 8 <split [492/54]> Fold08
 9 <split [492/54]> Fold09
10 <split [492/54]> Fold10

$metadata
$metadata$method
[1] "cv"

$metadata$params
$metadata$params$v
[1] 10

$metadata$params$repeats
[1] 1

$metadata$params$strata
[1] "Class"



attr(,"class")
[1] "fastml_resample_plan" "list"                

This guarantees that:

  • every model is trained and evaluated on the same splits
  • differences in performance are attributable to the model, not the data

Aggregated performance

We first examine performance aggregated across folds.

fit$resampling_results$`logistic_reg (glm)`$aggregated
# A tibble: 7 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.914
2 f_meas    binary         0.936
3 kap       binary         0.804
4 precision binary         0.907
5 roc_auc   binary         0.923
6 sens      binary         0.969
7 spec      binary         0.812
fit$resampling_results$`rand_forest (ranger)`$aggregated
# A tibble: 7 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.962
2 f_meas    binary         0.970
3 kap       binary         0.916
4 precision binary         0.975
5 roc_auc   binary         0.991
6 sens      binary         0.966
7 spec      binary         0.953
fit$resampling_results$`xgboost (xgboost)`$aggregated
# A tibble: 7 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.949
2 f_meas    binary         0.961
3 kap       binary         0.887
4 precision binary         0.957
5 roc_auc   binary         0.987
6 sens      binary         0.966
7 spec      binary         0.917

These summaries provide point estimates under guarded cross-validation.

Fold-level variability

Fair comparison also requires inspecting variability across folds.

fit$resampling_results$`rand_forest (ranger)`$folds
# A tibble: 70 × 4
   fold  .metric   .estimator .estimate
   <chr> <chr>     <chr>          <dbl>
 1 1     accuracy  binary         0.929
 2 1     kap       binary         0.844
 3 1     sens      binary         0.944
 4 1     spec      binary         0.9  
 5 1     precision binary         0.944
 6 1     f_meas    binary         0.944
 7 1     roc_auc   binary         0.968
 8 2     accuracy  binary         0.964
 9 2     kap       binary         0.922
10 2     sens      binary         0.944
# ℹ 60 more rows

Fold-level results reveal:

  • instability
  • sensitivity to data partitioning
  • overlap between models’ performance distributions

Single numbers are rarely sufficient.

Best model selection

fastml identifies a “best” model based on the primary metric.

fit$best_model_name
rand_forest     xgboost 
   "ranger"   "xgboost" 

This selection is descriptive, not inferential.

It indicates which model performed best under the chosen metric, resampling scheme, and random seed, not which model is universally superior.

Why this comparison is valid

This comparison is valid because:

  • preprocessing is guarded
  • resampling is shared
  • evaluation is inseparable from fitting
  • no model receives information unavailable to others

These properties hold regardless of model complexity.

What fastml does not allow here

Consistent with C4 — What fastml Deliberately Does Not Allow, within a single fastml benchmarking call users cannot:

  • evaluate models on different folds
  • reuse preprocessing across models
  • tune one model more aggressively than another
  • detach evaluation from resampling

These restrictions are necessary for fairness.

Summary

  • Fair model comparison requires shared resampling.
  • fastml enforces this structurally.
  • Differences in performance reflect models, not evaluation artifacts.
  • “Best model” selection is contextual and metric-dependent.

What comes next

03. Metrics, Variability, and Uncertainty
Fold variability, bootstrap metrics and over-interpretation of small differences.