01. Basic Classification

Before you start

This tutorial builds on the following conceptual material:

Readers are expected to be familiar with these concepts before proceeding.

The workflow demonstrated here uses a deliberately constrained interface. These constraints are intentional and reflect the design philosophy of fastml: to reduce common sources of evaluation error by limiting user-facing degrees of freedom along the default execution path.

The problem we are solving

The goal of this tutorial is to estimate the out-of-sample performance of a binary classifier.

The focus is not on maximizing predictive accuracy, tuning hyperparameters, or examining model internals. Instead, the objective is narrowly defined:

What level of predictive performance can reasonably be expected on new, unseen data, given a fixed modeling specification?

In this setting, model fitting itself is straightforward. The primary challenge lies in performance evaluation.

Specifically, the difficulty is to obtain an estimate that is not biased by information leakage or other forms of contamination arising from reuse of the data during training and evaluation.

Why this is harder than it sounds

From C1 and C2, recall the following points:

  • Leakage does not require obvious mistakes.
  • Pipelines that appear valid can still yield biased performance estimates.
  • Most machine learning frameworks allow such pipelines to run without warnings.

From C3, recall:

  • Correct evaluation requires resampling to enclose preprocessing and model fitting.
  • This enclosure must be structural, not procedural.

From C4, recall:

  • fastml enforces correctness by removing degrees of freedom rather than relying on user discipline.

This tutorial demonstrates what that enforcement looks like in practice.

The data

This tutorial uses a simple binary classification dataset from the modeldata package.

The dataset is characterized by:

  • a binary outcome variable (Class),
  • a small number of continuous predictors (A and B),
  • the absence of missing values.

This choice is intentional. The dataset is deliberately low-dimensional and clean, so that attention can remain on the evaluation procedure rather than on feature engineering, preprocessing decisions, or data-quality issues.

library(modeldata)
data(two_class_dat)
head(two_class_dat)
         A        B  Class
1 2.069730 1.631647 Class1
2 2.016415 1.036629 Class1
3 1.688555 1.366610 Class2
4 3.434538 1.979776 Class2
5 2.884596 1.975891 Class1
6 3.313589 2.405875 Class2

The purpose of this example is to illustrate evaluation mechanics rather than data preprocessing challenges. Although the dataset itself is simple, the evaluation principles demonstrated here extend to more complex datasets and other supported tasks in fastml, subject to their respective assumptions and constraints.

What you do not need to do in fastml

Before showing the workflow, it is important to be explicit.

In fastml, users are not required to:

  • manually assemble train–test splits,
  • explicitly construct preprocessing recipes,
  • apply scaling or imputation outside the resampling loop,
  • compose workflows from loosely coupled components,
  • control when preprocessing is trained relative to resampling,
  • directly manipulate resampling objects during model execution.

These steps are common entry points for data leakage.

By default, fastml executes preprocessing, model fitting, and evaluation within a single, resampling-aware structure. While advanced users may override specific components, the standard execution path is designed to preserve training–assessment isolation without relying on user discipline.

Declaring intent

In fastml, the user specifies what is to be evaluated rather than manually assembling the individual components of a modeling pipeline.

At a minimum, this specification includes:

  • the dataset,
  • the outcome variable,
  • the set of algorithms to be evaluated,
  • the intended resampling strategy.
library(fastml)

fit <- fastml(
  data       = two_class_dat,
  label      = "Class",
  algorithms = c("rand_forest", "xgboost"),
  resampling = "cv",
  folds      = 5,
)

This call defines the full evaluation setup under the default execution path. Model fitting, preprocessing, and performance estimation are carried out internally according to the declared intent and the constraints imposed by fastml.

What happens internally

The behavior described below reflects the default execution path in fastml and is not exposed for routine user configuration.

For each resampling split:

  • training data are defined according to the resampling specification,
  • any preprocessing steps are estimated using the training data only,
  • models are fitted on the resulting training set,
  • predictions are generated for the corresponding assessment set,
  • performance metrics are computed exclusively on assessment data.

As an additional safety measure, fastml performs internal checks on the resampling structure. If a resample is detected in which the training set coincides with the full dataset, execution is halted and the run is flagged as unsafe.

Why this differs from typical workflows

In many machine learning frameworks:

  • pipelines that violate evaluation assumptions can be constructed,
  • correct evaluation depends largely on user discipline and correct assembly of components,
  • violations such as preprocessing leakage may execute without explicit warnings.

In fastml:

  • common classes of incorrect evaluation pipelines are restricted along the default execution path,
  • methodological correctness is less dependent on user assembly decisions,
  • key evaluation invariants are checked and enforced during execution.

This reflects a deliberate design trade-off: reducing flexibility in pipeline construction in order to lower the risk of undetected evaluation errors.

Inspecting results

Once execution is complete, performance estimates can be accessed directly from the fitted object.

fit$performance$rand_forest$ranger
# A tibble: 7 × 6
  .metric   .estimator .estimate .lower .upper .n_boot
  <chr>     <chr>          <dbl>  <dbl>  <dbl>   <dbl>
1 accuracy  binary         0.818  0.761  0.878     500
2 kap       binary         0.629  0.511  0.748     500
3 sens      binary         0.852  0.777  0.922     500
4 spec      binary         0.775  0.685  0.875     500
5 precision binary         0.824  0.747  0.902     500
6 f_meas    binary         0.838  0.777  0.894     500
7 roc_auc   binary         0.874  0.822  0.928     500
fit$performance$xgboost$xgboost
# A tibble: 7 × 6
  .metric   .estimator .estimate .lower .upper .n_boot
  <chr>     <chr>          <dbl>  <dbl>  <dbl>   <dbl>
1 accuracy  binary         0.811  0.748  0.874     500
2 kap       binary         0.617  0.489  0.742     500
3 sens      binary         0.841  0.758  0.916     500
4 spec      binary         0.775  0.683  0.869     500
5 precision binary         0.822  0.746  0.898     500
6 f_meas    binary         0.831  0.766  0.891     500
7 roc_auc   binary         0.864  0.806  0.920     500

The reported metrics summarize performance estimates obtained under 5-fold cross-validation for each model.

For both the random forest and xgboost models, these estimates are computed on assessment data that are held out from model fitting within each resampling split. They therefore differ from training-set performance and are not derived from post-hoc adjustments or recalibration.

In this example, the random forest model attains slightly higher average performance across most metrics, including accuracy, Cohen’s kappa, F1 score, and ROC AUC. However, the differences relative to xgboost are modest, and both models display a similar balance between sensitivity and specificity. These results indicate comparable overall performance with small differences in error structure rather than a clear dominance of one model over the other.

All reported values arise directly from the resampling-based evaluation procedure used by fastml, in which preprocessing, model fitting, and performance estimation are executed within a single, resampling-aware structure. The metrics therefore represent cross-validated performance summaries under the declared evaluation setup, subject to the usual variability and limitations inherent to resampling-based estimates.

Fold-Level variability

fit$resampling_results$`rand_forest (ranger)`$folds
# A tibble: 35 × 4
   fold  .metric   .estimator .estimate
   <chr> <chr>     <chr>          <dbl>
 1 1     accuracy  binary         0.850
 2 1     kap       binary         0.699
 3 1     sens      binary         0.843
 4 1     spec      binary         0.860
 5 1     precision binary         0.881
 6 1     f_meas    binary         0.861
 7 1     roc_auc   binary         0.874
 8 2     accuracy  binary         0.890
 9 2     kap       binary         0.779
10 2     sens      binary         0.871
# ℹ 25 more rows
fit$resampling_results$`xgboost (xgboost)`$folds
# A tibble: 35 × 4
   fold  .metric   .estimator .estimate
   <chr> <chr>     <chr>          <dbl>
 1 1     accuracy  binary         0.843
 2 1     kap       binary         0.685
 3 1     sens      binary         0.814
 4 1     spec      binary         0.877
 5 1     precision binary         0.891
 6 1     f_meas    binary         0.851
 7 1     roc_auc   binary         0.875
 8 2     accuracy  binary         0.866
 9 2     kap       binary         0.730
10 2     sens      binary         0.871
# ℹ 25 more rows

Performance estimates vary across resampling folds, across metrics, and across models.

In this example, both the random forest and xgboost models exhibit noticeable but moderate fold-to-fold variability. Accuracy, sensitivity, specificity, and agreement-based measures such as Cohen’s kappa fluctuate across folds for both models, reflecting differences in class composition and difficulty across resampling splits.

Across folds, the two models show broadly comparable behavior. The random forest model attains slightly higher values in some folds, while the xgboost model matches or closely tracks its performance in others. For both models, sensitivity and specificity remain relatively balanced across folds, and no systematic metric asymmetry or degenerate behavior is observed. Variation in kappa and accuracy remains within a range consistent with expected resampling variability rather than indicating instability.

This pattern illustrates that, even when aggregated summaries suggest similar overall performance, fold-level inspection remains necessary to assess stability, class-wise trade-offs, and the influence of individual resampling splits. Such variability is an inherent feature of resampling-based evaluation and cannot be fully characterized by a single averaged estimate.

Model comparison

summary(fit)

===== fastml Model Summary =====
Task: classification 
Number of Models Trained: 2 
Best Model(s): rand_forest (ranger) (accuracy: 0.8176101) 

Performance Metrics (Sorted by accuracy):

---------------------------------------------------------------------------------------------- 
Model         Engine   Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
---------------------------------------------------------------------------------------------- 
rand_forest*  ranger   0.818     0.838     0.629  0.824      0.852        0.775        0.874   
xgboost       xgboost  0.811     0.831     0.617  0.822      0.841        0.775        0.864   
---------------------------------------------------------------------------------------------- 
(*Best model)

Best Model hyperparameters:

Model: rand_forest (ranger) 
  mtry: 1
  trees: 500
  min_n: 10


===========================
Confusion Matrices by Model
===========================

Model: rand_forest (ranger) 
---------------------------
          Truth
Prediction Class1 Class2
    Class1     75     16
    Class2     13     55

The summary output compares models evaluated under an identical resampling specification.

In this example, the random forest model attains slightly higher average performance across all reported metrics, including accuracy, F1 score, Cohen’s kappa, sensitivity, specificity, and ROC AUC. The xgboost model shows closely comparable performance, with modestly lower values across the same metrics. The differences between models are small and reflect incremental variations in error rates rather than qualitatively distinct error profiles.

Because all models are evaluated using the same resampling splits, observed performance differences can be attributed to the modeling approaches rather than to variation in data partitioning. This consistency is not merely a convenience; it is a methodological requirement for meaningful model comparison under resampling-based evaluation.

What is guaranteed here

Subject to the constraints accepted along the default execution path, fastml provides the following assurances:

  • preprocessing steps are estimated separately within each resampling split, preventing information flow across folds,
  • model training and evaluation are carried out on disjoint data subsets within each split,
  • all algorithms are evaluated using an identical resampling structure,
  • reported performance metrics correspond to resampling-based estimates of out-of-sample performance.

These properties arise from the architectural design of fastml rather than from user-enforced conventions. They reflect enforced evaluation invariants under standard usage, not informal best-practice recommendations.

What fastml cannot guarantee

As emphasized in the accompanying manuscript, fastml does not and cannot guarantee:

  • that outcome variables are correctly defined or scientifically meaningful,
  • that the raw features are free from prior leakage, measurement artifacts, or target contamination,
  • that the learning task itself addresses a scientifically relevant question,
  • that the chosen performance metrics are appropriate for the scientific or clinical context.

Methodological safeguards can reduce certain classes of technical error, but they cannot substitute for domain expertise, sound study design, or scientific judgment.

Summary

This tutorial did not focus on constructing highly flexible modeling pipelines.

Instead, it demonstrated how fastml limits common pathways to invalid evaluation by constraining how models are trained, evaluated, and compared along the default execution path.

The distinguishing characteristic of fastml is therefore not automation for its own sake, but the enforcement of evaluation invariants intended to reduce methodological errors in performance estimation.

What comes next

02. Multiple Models and Fair Comparison
Why comparing models is statistically invalid without shared resampling.