library(modeldata)
data(two_class_dat)
head(two_class_dat) A B Class
1 2.069730 1.631647 Class1
2 2.016415 1.036629 Class1
3 1.688555 1.366610 Class2
4 3.434538 1.979776 Class2
5 2.884596 1.975891 Class1
6 3.313589 2.405875 Class2
This tutorial builds on the following conceptual material:
Readers are expected to be familiar with these concepts before proceeding.
The workflow demonstrated here uses a deliberately constrained interface. These constraints are intentional and reflect the design philosophy of fastml: to reduce common sources of evaluation error by limiting user-facing degrees of freedom along the default execution path.
The goal of this tutorial is to estimate the out-of-sample performance of a binary classifier.
The focus is not on maximizing predictive accuracy, tuning hyperparameters, or examining model internals. Instead, the objective is narrowly defined:
What level of predictive performance can reasonably be expected on new, unseen data, given a fixed modeling specification?
In this setting, model fitting itself is straightforward. The primary challenge lies in performance evaluation.
Specifically, the difficulty is to obtain an estimate that is not biased by information leakage or other forms of contamination arising from reuse of the data during training and evaluation.
From C1 and C2, recall the following points:
From C3, recall:
From C4, recall:
This tutorial demonstrates what that enforcement looks like in practice.
This tutorial uses a simple binary classification dataset from the modeldata package.
The dataset is characterized by:
Class),A and B),This choice is intentional. The dataset is deliberately low-dimensional and clean, so that attention can remain on the evaluation procedure rather than on feature engineering, preprocessing decisions, or data-quality issues.
library(modeldata)
data(two_class_dat)
head(two_class_dat) A B Class
1 2.069730 1.631647 Class1
2 2.016415 1.036629 Class1
3 1.688555 1.366610 Class2
4 3.434538 1.979776 Class2
5 2.884596 1.975891 Class1
6 3.313589 2.405875 Class2
The purpose of this example is to illustrate evaluation mechanics rather than data preprocessing challenges. Although the dataset itself is simple, the evaluation principles demonstrated here extend to more complex datasets and other supported tasks in fastml, subject to their respective assumptions and constraints.
Before showing the workflow, it is important to be explicit.
In fastml, users are not required to:
These steps are common entry points for data leakage.
By default, fastml executes preprocessing, model fitting, and evaluation within a single, resampling-aware structure. While advanced users may override specific components, the standard execution path is designed to preserve training–assessment isolation without relying on user discipline.
In fastml, the user specifies what is to be evaluated rather than manually assembling the individual components of a modeling pipeline.
At a minimum, this specification includes:
library(fastml)
fit <- fastml(
data = two_class_dat,
label = "Class",
algorithms = c("rand_forest", "xgboost"),
resampling = "cv",
folds = 5,
)This call defines the full evaluation setup under the default execution path. Model fitting, preprocessing, and performance estimation are carried out internally according to the declared intent and the constraints imposed by fastml.
The behavior described below reflects the default execution path in fastml and is not exposed for routine user configuration.
For each resampling split:
As an additional safety measure, fastml performs internal checks on the resampling structure. If a resample is detected in which the training set coincides with the full dataset, execution is halted and the run is flagged as unsafe.
In many machine learning frameworks:
In fastml:
This reflects a deliberate design trade-off: reducing flexibility in pipeline construction in order to lower the risk of undetected evaluation errors.
Once execution is complete, performance estimates can be accessed directly from the fitted object.
fit$performance$rand_forest$ranger# A tibble: 7 × 6
.metric .estimator .estimate .lower .upper .n_boot
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 accuracy binary 0.818 0.761 0.878 500
2 kap binary 0.629 0.511 0.748 500
3 sens binary 0.852 0.777 0.922 500
4 spec binary 0.775 0.685 0.875 500
5 precision binary 0.824 0.747 0.902 500
6 f_meas binary 0.838 0.777 0.894 500
7 roc_auc binary 0.874 0.822 0.928 500
fit$performance$xgboost$xgboost# A tibble: 7 × 6
.metric .estimator .estimate .lower .upper .n_boot
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 accuracy binary 0.811 0.748 0.874 500
2 kap binary 0.617 0.489 0.742 500
3 sens binary 0.841 0.758 0.916 500
4 spec binary 0.775 0.683 0.869 500
5 precision binary 0.822 0.746 0.898 500
6 f_meas binary 0.831 0.766 0.891 500
7 roc_auc binary 0.864 0.806 0.920 500
The reported metrics summarize performance estimates obtained under 5-fold cross-validation for each model.
For both the random forest and xgboost models, these estimates are computed on assessment data that are held out from model fitting within each resampling split. They therefore differ from training-set performance and are not derived from post-hoc adjustments or recalibration.
In this example, the random forest model attains slightly higher average performance across most metrics, including accuracy, Cohen’s kappa, F1 score, and ROC AUC. However, the differences relative to xgboost are modest, and both models display a similar balance between sensitivity and specificity. These results indicate comparable overall performance with small differences in error structure rather than a clear dominance of one model over the other.
All reported values arise directly from the resampling-based evaluation procedure used by fastml, in which preprocessing, model fitting, and performance estimation are executed within a single, resampling-aware structure. The metrics therefore represent cross-validated performance summaries under the declared evaluation setup, subject to the usual variability and limitations inherent to resampling-based estimates.
fit$resampling_results$`rand_forest (ranger)`$folds# A tibble: 35 × 4
fold .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 1 accuracy binary 0.850
2 1 kap binary 0.699
3 1 sens binary 0.843
4 1 spec binary 0.860
5 1 precision binary 0.881
6 1 f_meas binary 0.861
7 1 roc_auc binary 0.874
8 2 accuracy binary 0.890
9 2 kap binary 0.779
10 2 sens binary 0.871
# ℹ 25 more rows
fit$resampling_results$`xgboost (xgboost)`$folds# A tibble: 35 × 4
fold .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 1 accuracy binary 0.843
2 1 kap binary 0.685
3 1 sens binary 0.814
4 1 spec binary 0.877
5 1 precision binary 0.891
6 1 f_meas binary 0.851
7 1 roc_auc binary 0.875
8 2 accuracy binary 0.866
9 2 kap binary 0.730
10 2 sens binary 0.871
# ℹ 25 more rows
Performance estimates vary across resampling folds, across metrics, and across models.
In this example, both the random forest and xgboost models exhibit noticeable but moderate fold-to-fold variability. Accuracy, sensitivity, specificity, and agreement-based measures such as Cohen’s kappa fluctuate across folds for both models, reflecting differences in class composition and difficulty across resampling splits.
Across folds, the two models show broadly comparable behavior. The random forest model attains slightly higher values in some folds, while the xgboost model matches or closely tracks its performance in others. For both models, sensitivity and specificity remain relatively balanced across folds, and no systematic metric asymmetry or degenerate behavior is observed. Variation in kappa and accuracy remains within a range consistent with expected resampling variability rather than indicating instability.
This pattern illustrates that, even when aggregated summaries suggest similar overall performance, fold-level inspection remains necessary to assess stability, class-wise trade-offs, and the influence of individual resampling splits. Such variability is an inherent feature of resampling-based evaluation and cannot be fully characterized by a single averaged estimate.
summary(fit)
===== fastml Model Summary =====
Task: classification
Number of Models Trained: 2
Best Model(s): rand_forest (ranger) (accuracy: 0.8176101)
Performance Metrics (Sorted by accuracy):
----------------------------------------------------------------------------------------------
Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
----------------------------------------------------------------------------------------------
rand_forest* ranger 0.818 0.838 0.629 0.824 0.852 0.775 0.874
xgboost xgboost 0.811 0.831 0.617 0.822 0.841 0.775 0.864
----------------------------------------------------------------------------------------------
(*Best model)
Best Model hyperparameters:
Model: rand_forest (ranger)
mtry: 1
trees: 500
min_n: 10
===========================
Confusion Matrices by Model
===========================
Model: rand_forest (ranger)
---------------------------
Truth
Prediction Class1 Class2
Class1 75 16
Class2 13 55
The summary output compares models evaluated under an identical resampling specification.
In this example, the random forest model attains slightly higher average performance across all reported metrics, including accuracy, F1 score, Cohen’s kappa, sensitivity, specificity, and ROC AUC. The xgboost model shows closely comparable performance, with modestly lower values across the same metrics. The differences between models are small and reflect incremental variations in error rates rather than qualitatively distinct error profiles.
Because all models are evaluated using the same resampling splits, observed performance differences can be attributed to the modeling approaches rather than to variation in data partitioning. This consistency is not merely a convenience; it is a methodological requirement for meaningful model comparison under resampling-based evaluation.
Subject to the constraints accepted along the default execution path, fastml provides the following assurances:
These properties arise from the architectural design of fastml rather than from user-enforced conventions. They reflect enforced evaluation invariants under standard usage, not informal best-practice recommendations.
As emphasized in the accompanying manuscript, fastml does not and cannot guarantee:
Methodological safeguards can reduce certain classes of technical error, but they cannot substitute for domain expertise, sound study design, or scientific judgment.
This tutorial did not focus on constructing highly flexible modeling pipelines.
Instead, it demonstrated how fastml limits common pathways to invalid evaluation by constraining how models are trained, evaluated, and compared along the default execution path.
The distinguishing characteristic of fastml is therefore not automation for its own sake, but the enforcement of evaluation invariants intended to reduce methodological errors in performance estimation.
02. Multiple Models and Fair Comparison
Why comparing models is statistically invalid without shared resampling.