fastml logo

1 Introduction

The fastml package provides a unified and efficient framework for training, evaluating, and comparing multiple machine learning models in R. It is designed to minimize repetitive coding and automate essential steps of a typical machine learning workflow.

With a single, consistent interface, fastml enables researchers and data scientists to perform end-to-end analysis — from data preprocessing to model evaluation — with minimal manual intervention.

Key features include:

Comprehensive Data Preprocessing:
Automatically handles missing values, encodes categorical variables, and applies user-specified normalization or scaling methods.
Multi-Algorithm Support:
Trains and compares a broad range of models such as Random Forests, XGBoost, Support Vector Machines, Neural Networks, and Generalized Linear Models with a single function call.
Task Auto-Detection:
Detects the nature of the modeling task—classification, regression, or survival analysis—based on the provided outcome variable.
Flexible Hyperparameter Tuning:
Supports both default and user-defined tuning strategies, including grid search, random search, and Bayesian optimization.
Comprehensive Evaluation and Visualization:
Generates detailed performance metrics, confusion matrices, and diagnostic plots such as ROC curves, residual plots, and feature importance visualizations.

This vignette introduces the main functionality of fastml through practical examples. You will learn how to set up your data, train multiple models, fine-tune them, and interpret their results efficiently.

2 Philosophy: The fastml Approach

The R ecosystem — particularly the tidymodels framework — provides a rich, modular toolkit for building sophisticated and fully customized modeling workflows.
While this flexibility is ideal for in-depth research, it can sometimes feel cumbersome when your goal is to obtain a fast, reliable baseline model.

In many applied settings, the objective is not to engineer the most complex pipeline, but to rapidly compare several well-established algorithms under consistent preprocessing and evaluation procedures.
This is where fastml excels.

The philosophy behind fastml is simple: automate the most common 80% of the modeling workflow so you can focus on interpretation, insight, and decision-making rather than boilerplate code.

The core function, fastml(), provides an opinionated but extensible one-stop interface that automatically performs:

Task Auto-Detection:

Automatically detects whether you are performing classification, regression, or survival analysis based on your target variable (label).
Data Splitting and Stratification:
Creates reproducible training and testing partitions, optionally stratified by class labels.
Preprocessing Pipeline:
Builds a robust recipe that handles missing values, encodes categorical variables, and applies standardization or scaling as needed.
Multi-Algorithm Training:
Trains multiple algorithms in parallel — from classical GLMs to ensemble and tree-based models — with consistent cross-validation.
Hyperparameter Tuning and Evaluation:
Supports efficient tuning strategies and delivers comprehensive performance summaries.

In essence, fastml brings together best practices from the tidymodels ecosystem into a concise and reproducible interface — allowing you to move seamlessly from a raw data frame to a transparent model comparison in a single step.

3 Installation

You can install the stable version of fastml from CRAN:

# Install just the core package
install.packages("fastml")

# Install with all model dependencies (recommended)
install.packages("fastml", dependencies = TRUE)

If you prefer to use the most recent development version with the latest updates and features, you can install it from GitHub:

install.packages("devtools")
devtools::install_github("selcukorkmaz/fastml")

After installation, load the package:

library(fastml)

You are now ready to begin building and evaluating machine learning models with fastml.

4 Your First Workflow: Classification

Let’s begin with a simple binary classification example using the classic iris dataset.
Here, we will predict whether a flower is versicolor or virginica based on its sepal and petal measurements.

We first prepare and explore the data using tidyverse tools before passing it to fastml().

data(iris)

iris_binary <- iris %>%
  filter(Species != "setosa") %>%
  mutate(Species = factor(Species)) # Ensure label is a factor

# Optional: quick exploratory plot
iris_binary %>%
  ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  labs(title = "Exploring Our Data")

Now, we pass this clean iris_binary data frame to fastml(). Notice we only provide the label. fastml inspects the Species column and, seeing it’s a factor, automatically detects a classification task. In this example, we train two algorithms — a Random Forest and a Logistic Regression model.

model_class <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = c("rand_forest", "logistic_reg")
)

After training, we can inspect the performance results. The summary() function provides an overview of all trained models, ranked by their primary evaluation metric (typically accuracy or AUC).

summary(model_class)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 2 
#> Best Model(s): rand_forest (ranger) (accuracy: 0.9500000) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> --------------------------------------------------------------------------------------------- 
#> Model         Engine  Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> --------------------------------------------------------------------------------------------- 
#> rand_forest*  ranger  0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> logistic_reg  glm     0.900     0.889     0.800  1.000      0.800        1.000        0.900   
#> --------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 2
#>   trees: 500
#>   min_n: 10
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: rand_forest (ranger) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10

The model marked with * in the summary output represents the best-performing model based on the selected evaluation criterion. This quick workflow demonstrates how fastml enables rapid, consistent model benchmarking with minimal code.

5 Visualizing Results

fastml includes built-in visualization tools that allow quick inspection and comparison of model performance.
Depending on the analysis type, different plots are available for classification, regression, and survival tasks.

For classification problems, the "bar" and "roc" plot types are the most commonly used.

# Plot the performance metrics
plot(model_class, type = "bar")


# Plot ROC curves
plot(model_class, type = "roc")

The bar plot summarizes key performance metrics (e.g., Accuracy, AUC, F1) across all trained models, helping identify the top performer at a glance. The ROC plot illustrates how well each classifier separates the two classes across varying thresholds.

However, a high ROC AUC does not necessarily mean that the model’s predicted probabilities are well calibrated. For example, a model may predict “80% chance of being Virginica,” but the event occurs only 60% of the time. To assess calibration quality, you can use the “calibration” plot:

plot(model_class, type = "calibration")

The calibration plot compares predicted probabilities with observed frequencies. A perfectly calibrated model will follow the diagonal line, while deviations indicate over- or under-confidence in the model’s predictions.

6 Discovering Available Models

fastml supports a broad set of algorithms covering classification, regression, and survival analysis tasks.
These include both traditional statistical models and modern machine learning algorithms such as Random Forests, Gradient Boosting, and Neural Networks.

To view the available algorithms for each task type, use the helper function availableMethods().

# List all supported classification algorithms
availableMethods("classification")
#>  [1] "logistic_reg"     "multinom_reg"     "decision_tree"    "C5_rules"         "rand_forest"     
#>  [6] "xgboost"          "lightgbm"         "svm_linear"       "svm_rbf"          "nearest_neighbor"
#> [11] "naive_Bayes"      "mlp"              "discrim_linear"   "discrim_quad"     "bag_tree"

# List all supported regression algorithms
availableMethods("regression")
#>  [1] "linear_reg"       "ridge_reg"        "lasso_reg"        "elastic_net"      "decision_tree"   
#>  [6] "rand_forest"      "xgboost"          "lightgbm"         "svm_linear"       "svm_rbf"         
#> [11] "nearest_neighbor" "mlp"              "pls"              "bayes_glm"

# List all supported survival analysis algorithms
availableMethods("survival")
#>  [1] "rand_forest"      "cox_ph"           "penalized_cox"    "stratified_cox"   "time_varying_cox"
#>  [6] "survreg"          "royston_parmar"   "parametric_surv"  "piecewise_exp"    "xgboost"

This function returns the algorithm names recognized by fastml, which can be passed directly to the algorithms argument of the fastml() function.

7 Running All Models at Once

In many cases, you may want to benchmark multiple algorithms to identify the best-performing model for your dataset.
Instead of manually specifying individual algorithms, you can simply set algorithms = "all".

When this option is used, fastml automatically runs every supported model for the selected task (classification, regression, or survival analysis).

The summary() output then ranks all models according to their key performance metric.

# This process may take several minutes depending on data size and models
model_battle <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "all",
)
#>  Setting default kernel parameters

# Display a ranked summary of all trained models
summary(model_battle)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 14 
#> Best Model(s): C5_rules (C5.0) rand_forest (ranger) xgboost (xgboost) svm_rbf (kernlab) naive_Bayes (klaR) discrim_linear (MASS) discrim_quad (MASS) (accuracy: 0.9500000) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> --------------------------------------------------------------------------------------------------- 
#> Model             Engine    Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> --------------------------------------------------------------------------------------------------- 
#> C5_rules*         C5.0      0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> rand_forest*      ranger    0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> xgboost*          xgboost   0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> svm_rbf*          kernlab   0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> naive_Bayes*      klaR      0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> discrim_linear*   MASS      0.950     0.947     0.900  1.000      0.900        1.000        0.990   
#> discrim_quad*     MASS      0.950     0.947     0.900  1.000      0.900        1.000        0.990   
#> logistic_reg      glm       0.900     0.889     0.800  1.000      0.800        1.000        0.900   
#> decision_tree     rpart     0.900     0.889     0.800  1.000      0.800        1.000        0.900   
#> lightgbm          lightgbm  0.900     0.889     0.800  1.000      0.800        1.000        0.980   
#> svm_linear        kernlab   0.900     0.889     0.800  1.000      0.800        1.000        1.000   
#> nearest_neighbor  kknn      0.900     0.889     0.800  1.000      0.800        1.000        0.950   
#> mlp               nnet      0.900     0.889     0.800  1.000      0.800        1.000        0.990   
#> bag_tree          rpart     0.900     0.889     0.800  1.000      0.800        1.000        1.000   
#> --------------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: C5_rules (C5.0) 
#>   mtry: 
#>   trees: 50
#>   min_n: 5
#>   tree_depth: 
#>   learn_rate: 
#>   loss_reduction: 
#>   sample_size: 0.5
#>   stop_iter: 
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 2
#>   trees: 500
#>   min_n: 10
#> 
#> Model: xgboost (xgboost) 
#>   mtry: 2
#>   trees: 15
#>   min_n: 2
#>   tree_depth: 6
#>   learn_rate: 0.1
#>   loss_reduction: 0
#>   sample_size: 0.5
#>   stop_iter: 
#> 
#> Model: svm_rbf (kernlab) 
#>   cost: 1
#>   rbf_sigma: 0.1
#>   margin: 
#> 
#> Model: naive_Bayes (klaR) 
#>   smoothness: 1
#>   Laplace: 0
#> 
#> Model: discrim_linear (MASS) 
#>   penalty: 
#>   regularization_method: 
#> 
#> Model: discrim_quad (MASS) 
#>   regularization_method: 
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: C5_rules (C5.0) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: rand_forest (ranger) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: xgboost (xgboost) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: svm_rbf (kernlab) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: naive_Bayes (klaR) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: discrim_linear (MASS) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: discrim_quad (MASS) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10

This approach provides a quick, reproducible “model battle royale” that highlights which algorithms perform best on your data. It is especially useful for exploratory analysis or as a first step before focusing on fine-tuning specific models.

8 Workflow Example: Regression

fastml automatically detects a regression task when the outcome variable (label) is numeric.
In this example, we use the classic mtcars dataset to predict miles per gallon (mpg) based on engine and vehicle characteristics.

The model performance will be evaluated using Root Mean Squared Error (RMSE) as the optimization metric.

# 1. Prepare data
data(pbc, package = "survival")
  
# The pbc dataset has two parts; we only want the baseline data (rows 1-312)
pbc_baseline <- pbc[1:312, ]

# 2. Train regression models
# We'll compare a Random Forest and an XGBoost model
model_reg <- fastml(
    data = pbc_baseline,
    label = "albumin",
    algorithms = c("rand_forest", "xgboost"),
    metric = "rmse",            # Optimize for RMSE
    impute_method = "remove" # Remove missing values
  )

# 3. Summarize the results
summary(model_reg)
#> 
#> ===== fastml Model Summary =====
#> Task: regression 
#> Number of Models Trained: 2 
#> Best Model(s): rand_forest (ranger) (rmse: 0.3861062) 
#> 
#> Performance Metrics (Sorted by rmse):
#> 
#> ---------------------------------------------- 
#> Model         Engine   RMSE   R-squared  MAE   
#> ---------------------------------------------- 
#> rand_forest*  ranger   0.386  0.189      0.326 
#> xgboost       xgboost  0.772  0.170      0.681 
#> ---------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 4
#>   trees: 500
#>   min_n: 5

The summary output lists all trained models ranked by their RMSE values, where lower scores indicate better predictive accuracy. As with classification tasks, fastml standardizes data preprocessing and cross-validation, ensuring fair model comparison. You can visualize regression performance further using residual plot:

# Examine residual distribution
plot(model_reg, type = "residual")
#> 
#> Residual Diagnostics for Best Model:

These plots help assess whether errors are randomly distributed and whether any model exhibits systematic bias.

9 A Deeper Dive: Survival Analysis

fastml provides native support for survival analysis, enabling the use of both classical and modern flexible models.
A survival workflow is automatically triggered when the label argument includes two variables: time and status.

The package seamlessly integrates multiple survival modeling approaches, including:

Cox Proportional-Hazards model (survival package)
Royston–Parmar flexible parametric model (rstpm2 package)
General parametric survival models (flexsurvreg package)

In this example, we’ll fit a Cox Proportional-Hazards, a Weibull parametric, and an XGBoost survival model using the lung dataset.

# 1. Prepare data
library(survival)
data(lung, package = "survival")

# 2. Train survival models
model_surv <- fastml(
  data = lung,
  label = c("time", "status"),  # triggers survival analysis
  algorithms = c("cox_ph", "parametric_surv", "xgboost"),
  impute_method = "medianImpute",  # handle missing values

  # Specify the distribution for the parametric model
  engine_params = list(
    parametric_surv = list(
      flexsurvreg = list(dist = "weibull")
    )
  )
)

# 3. Summarize model performance
# Metrics include Harrell’s C-index, Uno’s C, and Integrated Brier Score (IBS)
summary(model_surv)
#> 
#> ===== fastml Model Summary =====
#> Task: survival 
#> Number of Models Trained: 3 
#> Best Model(s): parametric_surv (flexsurvreg) (ibs: 0.2160995) 
#> 
#> Performance Metrics (Sorted by ibs):
#> 
#> ------------------------------------------------------------------------------------------------------------------------------------- 
#> Model             Engine       Harrell C-index  Uno's C-index  Integrated Brier Score  RMST diff (t<=567)  Brier(t=292)  Brier(t=400) 
#> ------------------------------------------------------------------------------------------------------------------------------------- 
#> parametric_surv*  flexsurvreg  0.599            0.445          0.216                   -15.131             0.276         0.280        
#> cox_ph            survival     0.368            0.667          0.217                   89.855              0.280         0.284        
#> xgboost           aft          0.361            0.668          0.284                   124.897             0.362         0.344        
#> ------------------------------------------------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: parametric_surv (flexsurvreg) 
#>   Distribution: weibull 
#>   Coefficients (link scale):
#>           coef    
#> shape     0.3622  
#> scale     6.005   
#> inst      0.09212 
#> age       -0.06222
#> sex       0.2219  
#> ph.ecog   -0.3645 
#> ph.karno  -0.1745 
#> pat.karno 0.1441  
#> meal.cal  0.006096
#> wt.loss   0.02948 
#>   Parameter estimates:
#>           est      L95%      U95%      se     
#> shape     1.437    1.257     1.642     0.09795
#> scale     405.6    358.9     458.3     25.28  
#> inst      0.09212  -0.04101  0.2252    0.06792
#> age       -0.06222 -0.188    0.06354   0.06416
#> sex       0.2219   0.09421   0.3496    0.06514
#> ph.ecog   -0.3645  -0.5686   -0.1603   0.1042 
#> ph.karno  -0.1745  -0.3459   -0.003112 0.08744
#> pat.karno 0.1441   -0.004938 0.2932    0.07604
#> meal.cal  0.006096 -0.1354   0.1476    0.07219
#> wt.loss   0.02948  -0.0972   0.1562    0.06463
#>   Log-likelihood: -915 
#>   AIC: 1850 
#>   BIC: 1882 
#>   Sample size: 200 (events = 100, censored = 50)

The summary output for survival models provides specialized performance metrics that evaluate both discrimination and calibration. To inspect detailed model coefficients or distribution parameters, you can use the type = "params" option:

summary(model_surv, type = "params", algorithm = "cox_ph")
#> Selected Model hyperparameters:
#> 
#> Model: cox_ph (survival) 
#>   Coefficients (coef):
#>           coef     
#> inst      -0.1243  
#> age       0.1005   
#> sex       -0.3274  
#> ph.ecog   0.5268   
#> ph.karno  0.2303   
#> pat.karno -0.2028  
#> meal.cal  -0.005052
#> wt.loss   -0.03865 
#>   exp(coef):
#>           exp(coef)
#> inst      0.8831   
#> age       1.106    
#> sex       0.7208   
#> ph.ecog   1.693    
#> ph.karno  1.259    
#> pat.karno 0.8164   
#> meal.cal  0.995    
#> wt.loss   0.9621   
#>   Hazard Ratios (95% CI):
#>           HR     Lower 95% Upper 95%
#> inst      0.8831 0.73      1.068    
#> age       1.106  0.9235    1.324    
#> sex       0.7208 0.5993    0.867    
#> ph.ecog   1.693  1.253     2.29     
#> ph.karno  1.259  0.9793    1.618    
#> pat.karno 0.8164 0.6581    1.013    
#> meal.cal  0.995  0.811     1.221    
#> wt.loss   0.9621 0.8015    1.155    
#>   Likelihood ratio test: 35.84 on 8 df (p = 0.00001875) 
#>   Concordance (Harrell C-index): 0.3676

These outputs allow detailed inspection of model components and help compare the interpretability and flexibility of different survival approaches within a unified fastml workflow.

10 Handling Imbalanced Data

In many real-world classification problems, one class is much less frequent than the other.
This class imbalance can lead to biased models that favor the majority class and underperform on minority observations.

To mitigate this, fastml provides the balance_method argument, which controls how the training data is balanced before model fitting.
You can specify one of the following options:

"upsample" — randomly duplicates minority-class samples to achieve balance.
"downsample" — randomly removes majority-class samples to achieve balance.
"none" — disables balancing and uses the data as is (default).

Let’s demonstrate this feature using the BreastCancer dataset from the mlbench package, where the goal is to predict tumor type (benign vs malignant).

library(dplyr)
library(mlbench)

# Load and prepare data
data(BreastCancer)
bc_data <- BreastCancer %>%
  select(-Id)  # remove non-predictor column

# Examine class distribution
table(bc_data$Class)
#> 
#>    benign malignant 
#>       458       241

# Preview the data structure
head(bc_data)
#>   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
#> 1            5         1          1             1            2           1           3               1       1
#> 2            5         4          4             5            7          10           3               2       1
#> 3            3         1          1             1            2           2           3               1       1
#> 4            6         8          8             1            3           4           3               7       1
#> 5            4         1          1             3            2           1           3               1       1
#> 6            8        10         10             8            7          10           9               7       1
#>       Class
#> 1    benign
#> 2    benign
#> 3    benign
#> 4    benign
#> 5    benign
#> 6 malignant

Case 1: No Balancing (The Baseline)

First, we train on the original, imbalanced data.

model_none <- fastml(
    data = bc_data,
    label = "Class",
    algorithms = "logistic_reg",
    impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
    balance_method = "none"         # Key argument!
)
  
# Show the summary
summary(model_none)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 1 
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> ---------------------------------------------------------------------------------------------- 
#> Model          Engine  Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> ---------------------------------------------------------------------------------------------- 
#> logistic_reg*  glm     0.949     0.961     0.887  0.945      0.977        0.898        0.959   
#> ---------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: logistic_reg (glm) 
#>   penalty: 
#>   mixture: 
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: logistic_reg (glm) 
#> ---------------------------
#>            Truth
#> Prediction  benign malignant
#>   benign        86         5
#>   malignant      2        44

Case 2: Downsampling

Next, we train a model after downsampling the majority (benign) class to match the size of the minority (malignant) class.

model_down <- fastml(
    data = bc_data,
    label = "Class",
    algorithms = "logistic_reg",
    impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
    balance_method = "downsample"         # Key argument!
)
  
# Show the summary
summary(model_down)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 1 
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> ---------------------------------------------------------------------------------------------- 
#> Model          Engine  Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> ---------------------------------------------------------------------------------------------- 
#> logistic_reg*  glm     0.949     0.961     0.887  0.945      0.977        0.898        0.962   
#> ---------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: logistic_reg (glm) 
#>   penalty: 
#>   mixture: 
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: logistic_reg (glm) 
#> ---------------------------
#>            Truth
#> Prediction  benign malignant
#>   benign        86         5
#>   malignant      2        44

Case 3: Upsampling

Finally, we train a model after upsampling the minority (malignant) class to match the size of the majority (benign) class.

model_up <- fastml(
    data = bc_data,
    label = "Class",
    algorithms = "logistic_reg",
    impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
    balance_method = "upsample"         # Key argument!
)
  
# Show the summary
summary(model_up)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 1 
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> ---------------------------------------------------------------------------------------------- 
#> Model          Engine  Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> ---------------------------------------------------------------------------------------------- 
#> logistic_reg*  glm     0.949     0.961     0.887  0.945      0.977        0.898        0.987   
#> ---------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: logistic_reg (glm) 
#>   penalty: 
#>   mixture: 
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: logistic_reg (glm) 
#> ---------------------------
#>            Truth
#> Prediction  benign malignant
#>   benign        86         5
#>   malignant      2        44

Comparison

After training three models using different balancing strategies (none, downsample, and upsample), we can compare their performance metrics.
Although the confusion matrices appear similar for this dataset—due to the inherent stability of logistic regression—the ROC AUC scores reveal meaningful differences:

Method	ROC AUC
none	0.959
downsample	0.962
upsample	0.987

The ROC AUC metric evaluates how well the model ranks positive versus negative samples based on predicted probabilities.
Even when the discrete class predictions remain the same, differences in ROC AUC confirm that three distinct models were trained.

Among these, the model trained with upsampling achieved the highest ROC AUC, indicating superior discriminative performance.
This result highlights how addressing class imbalance—particularly through upsampling—can significantly improve the model’s ability to distinguish between benign and malignant cases in imbalanced medical datasets.

11 Imputation

fastml can handle missing data using several strategies via the impute_method argument.
While "medianImpute" is fast, you can use more powerful (but slower) methods:

"knnImpute" — Uses K-Nearest Neighbors to impute missing values based on similarity.
"mice" — Performs Multivariate Imputation by Chained Equations, a flexible and statistically grounded method.
"missForest" — Employs Random Forests to iteratively impute missing data.

The choice depends on dataset size, computational resources, and the degree of missingness.

Below, we demonstrate the use of MICE imputation with the lung dataset from the survival package.

library(survival)
data(lung, package = "survival")

# This code assumes you have the 'mice' package installed
# install.packages("mice")

model_mice <- fastml(
  data = lung,
  label = c("time", "status"),
  algorithms = "penalized_cox",
  impute_method = "mice" # Use MICE for imputation
)
#> 
#>  iter imp variable
#>   1   1  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   1   2  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   1   3  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   1   4  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   1   5  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   2   1  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   2   2  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   2   3  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   2   4  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   2   5  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   3   1  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   3   2  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   3   3  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   3   4  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   3   5  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   4   1  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   4   2  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   4   3  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   4   4  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   4   5  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   5   1  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   5   2  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   5   3  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   5   4  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#>   5   5  ph.ecog  ph.karno  pat.karno  meal.cal  wt.loss
#> 
#>  iter imp variable
#>   1   1  inst  pat.karno  meal.cal  wt.loss
#>   1   2  inst  pat.karno  meal.cal  wt.loss
#>   1   3  inst  pat.karno  meal.cal  wt.loss
#>   1   4  inst  pat.karno  meal.cal  wt.loss
#>   1   5  inst  pat.karno  meal.cal  wt.loss
#>   2   1  inst  pat.karno  meal.cal  wt.loss
#>   2   2  inst  pat.karno  meal.cal  wt.loss
#>   2   3  inst  pat.karno  meal.cal  wt.loss
#>   2   4  inst  pat.karno  meal.cal  wt.loss
#>   2   5  inst  pat.karno  meal.cal  wt.loss
#>   3   1  inst  pat.karno  meal.cal  wt.loss
#>   3   2  inst  pat.karno  meal.cal  wt.loss
#>   3   3  inst  pat.karno  meal.cal  wt.loss
#>   3   4  inst  pat.karno  meal.cal  wt.loss
#>   3   5  inst  pat.karno  meal.cal  wt.loss
#>   4   1  inst  pat.karno  meal.cal  wt.loss
#>   4   2  inst  pat.karno  meal.cal  wt.loss
#>   4   3  inst  pat.karno  meal.cal  wt.loss
#>   4   4  inst  pat.karno  meal.cal  wt.loss
#>   4   5  inst  pat.karno  meal.cal  wt.loss
#>   5   1  inst  pat.karno  meal.cal  wt.loss
#>   5   2  inst  pat.karno  meal.cal  wt.loss
#>   5   3  inst  pat.karno  meal.cal  wt.loss
#>   5   4  inst  pat.karno  meal.cal  wt.loss
#>   5   5  inst  pat.karno  meal.cal  wt.loss

summary(model_mice, type = "metrics")
#> 
#> ===== fastml Model Summary =====
#> Task: survival 
#> Number of Models Trained: 1 
#> Best Model(s): penalized_cox (glmnet) (ibs: 0.2213755) 
#> 
#> Performance Metrics (Sorted by ibs):
#> 
#> ------------------------------------------------------------------------------------------------------------------------------ 
#> Model           Engine  Harrell C-index  Uno's C-index  Integrated Brier Score  RMST diff (t<=567)  Brier(t=292)  Brier(t=400) 
#> ------------------------------------------------------------------------------------------------------------------------------ 
#> penalized_cox*  glmnet  0.629            0.337          0.221                   -48.383             0.283         0.291        
#> ------------------------------------------------------------------------------------------------------------------------------ 
#> (*Best model)

12 Hyperparameter Tuning

Most machine learning algorithms include hyperparameters that control model complexity, regularization, and learning dynamics.
By default, fastml uses each model’s standard parameter settings (use_default_tuning = FALSE), which are usually adequate for quick benchmarking.
However, you can easily enable automated or fully customized hyperparameter tuning to optimize model performance.

12.1 Enabling Tuning

To activate tuning, set use_default_tuning = TRUE.
When enabled, fastml either uses built-in tuning grids for each algorithm or applies your own custom grid if supplied through the tune_params argument.

12.2 Custom Tuning Grids

You can define a tuning grid manually for fine-grained control over search values.
The expected structure is:

list(algorithm = list(engine = list(
  param1 = c(values), param2 = c(values)
)))

Here’s an example tuning grid for the ranger engine of a Random Forest model:

# Define a custom grid for the 'ranger' engine of 'rand_forest'
my_tune_grid <- list(
  rand_forest = list(
    ranger = list(
      mtry = c(1, 2, 3),
      min_n = c(5, 10)
    )
  )
)

# Train model with custom tuning
model_custom_tune <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "rand_forest",
  tune_params = my_tune_grid,
  use_default_tuning = TRUE # Must be TRUE to enable tuning
)

# Inspect tuned parameters
summary(model_custom_tune, type = "params")
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 1
#>   trees: 100
#>   min_n: 5

When tuning is enabled, fastml automatically performs internal resampling and selects the parameter combination that yields the best validation performance. This allows efficient exploration of hyperparameter space with minimal manual coding, while maintaining compatibility with all supported algorithms and engines.

12.3 Bayesian Tuning

Grid search can be computationally expensive because it evaluates all parameter combinations.
To improve efficiency, fastml supports Bayesian optimization, which uses past evaluation results to guide the search toward promising parameter regions.

You can enable Bayesian tuning by setting tuning_strategy = "bayes" and use_default_tuning = TRUE.
The argument tuning_iterations controls how many optimization steps are performed.

# Bayesian hyperparameter tuning example
set.seed(123)
model_bayes <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "xgboost",
  use_default_tuning = TRUE,
  tuning_strategy = "bayes",     # enable Bayesian optimization
  tuning_iterations = 10         # number of search iterations
)

# Review the best-found parameters
summary(model_bayes, type = "params")
#> Best Model hyperparameters:
#> 
#> Model: xgboost (xgboost) 
#>   mtry: 3
#>   trees: 63
#>   min_n: 3
#>   tree_depth: 5
#>   learn_rate: 0.01331
#>   loss_reduction: 12.2
#>   sample_size: 0.5313
#>   stop_iter:

Bayesian optimization intelligently balances exploration (trying new areas of the parameter space) and exploitation (refining promising regions).

This often yields better models in fewer iterations compared to exhaustive grid or random search, making it suitable for complex models like XGBoost or neural networks where tuning spaces are large.

13 The tidymodels Bridge: Using Custom recipes

By default, fastml applies an internal preprocessing pipeline that includes data cleaning, encoding, and scaling.
However, in many research or production scenarios, you may need to define custom feature engineering steps or domain-specific transformations.

To support this flexibility, fastml seamlessly integrates with the tidymodels ecosystem.
You can pass your own untrained recipes object through the recipe argument, and fastml will use it for all models—skipping its internal preprocessing.
This approach allows you to combine the simplicity of fastml with the full power of tidymodels’ preprocessing framework.

library(recipes)

# 1. Define a custom tidymodels recipe
# Example: Normalize all numeric features and apply PCA
my_recipe <- recipe(Species ~ ., data = iris_binary) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors(), num_comp = 2)

# 2. Pass the custom recipe to fastml
# fastml will now use your recipe instead of its internal pipeline
model_recipe <- fastml(
  data = iris_binary,
  label = "Species",
  recipe = my_recipe,
  algorithms = c("rand_forest", "svm_rbf")
)

# 3. Summarize model performance
summary(model_recipe)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 2 
#> Best Model(s): rand_forest (ranger) (accuracy: 0.9000000) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> ---------------------------------------------------------------------------------------------- 
#> Model         Engine   Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> ---------------------------------------------------------------------------------------------- 
#> rand_forest*  ranger   0.900     0.900     0.800  0.900      0.900        0.900        0.990   
#> svm_rbf       kernlab  0.850     0.824     0.700  1.000      0.700        1.000        1.000   
#> ---------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 2
#>   trees: 500
#>   min_n: 10
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: rand_forest (ranger) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         1
#>   virginica           1         9

When a custom recipe is supplied:

fastml disables its internal preprocessing steps, such as imputation, encoding, and scaling.
The provided recipe is applied consistently across all algorithms, ensuring reproducibility.
You gain full control over feature engineering, transformations, and variable selection — ideal for advanced users already working within the tidymodels framework.

This integration allows you to combine the automation and multi-model comparison strengths of fastml with the flexibility and transparency of recipes.
It bridges quick experimentation with advanced, fully customized preprocessing workflows, preserving both reproducibility and analytical control.

13.1 Using Custom rsample Folds

By default, fastml automatically creates its own training, testing, and resampling splits.
However, in some cases — such as stratified sampling, blocked time-series splits, or custom validation designs — you may want full control over how the data is partitioned.

To do this, you can pass your own rsample object (e.g., vfold_cv, bootstraps, or mc_cv) to the resamples argument.
fastml will then use your predefined folds for training, validation, and tuning instead of generating its own.

library(rsample)

# 1. Create custom 5-fold cross-validation splits
my_folds <- vfold_cv(iris_binary, v = 5, strata = "Species")

# 2. Use the custom folds in fastml
model_custom_folds <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "svm_rbf",
  resamples = my_folds,       # pass custom resampling object
  use_default_tuning = TRUE   # tuning will use these folds
)

# 3. Inspect tuned parameters
summary(model_custom_folds, type = "params")
#> Best Model hyperparameters:
#> 
#> Model: svm_rbf (kernlab) 
#>   cost: 1
#>   rbf_sigma: 0.1
#>   margin:

This approach is particularly useful when you need to:

Maintain consistency across multiple modeling experiments, ensuring that identical folds and sampling strategies are reused.
Apply domain-specific validation designs, such as patient-level stratification or batch-wise splitting in biomedical or clinical datasets.
Guarantee reproducibility of resampling strategies across environments and packages, making results directly comparable between runs.

By combining rsample objects with fastml, you gain both flexibility and reproducibility while retaining a streamlined modeling workflow.
This integration bridges the convenience of automated model benchmarking with the precision of fully customized resampling strategies.

14 Advanced Engine & Parameter Control

Beyond hyperparameter tuning, fastml also provides direct control over the engine (the underlying R package used for model fitting) and any fixed engine-specific parameters.
This allows fine-grained customization, enabling side-by-side comparisons of different implementations of the same algorithm.

14.1 Comparing Multiple Engines

Many algorithms—such as Random Forests, Gradient Boosting, or SVMs—can be implemented using multiple back-end engines.
The algorithm_engines argument lets you specify which engines to use for each algorithm.
You can provide a vector of engine names for a single algorithm, and fastml will train and evaluate each engine separately under identical preprocessing and evaluation settings.

# Compare different Random Forest engines
model_engines <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "rand_forest",

  # Run 'rand_forest' twice: once with 'ranger', once with 'randomForest'
  algorithm_engines = list(
    rand_forest = c("ranger", "randomForest")
  )
)

# Summarize performance across engines
summary(model_engines)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 2 
#> Best Model(s): rand_forest (ranger) rand_forest (randomForest) (accuracy: 0.9500000) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> --------------------------------------------------------------------------------------------------- 
#> Model         Engine        Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> --------------------------------------------------------------------------------------------------- 
#> rand_forest*  ranger        0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> rand_forest*  randomForest  0.950     0.947     0.900  1.000      0.900        1.000        1.000   
#> --------------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 2
#>   trees: 500
#>   min_n: 10
#> 
#> Model: rand_forest (randomForest) 
#>   mtry: 2
#>   trees: 500
#>   min_n: 10
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: rand_forest (ranger) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10
#> 
#> Model: rand_forest (randomForest) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10

# Visualize engine comparison
plot(model_engines, type = "bar")

Both the summary() and plot() functions automatically recognize engine-level differences, enabling direct comparison of performance metrics, runtime, and model characteristics.
This capability is particularly valuable for reproducibility studies, benchmarking different implementations, and evaluating performance trade-offs between older and newer engines.

Beyond selecting engines, you can fine-tune their behavior using the engine_params argument, which allows you to pass fixed arguments directly to the underlying modeling functions
(for example, num.trees, max.depth, or kernel).

This balance of control and automation makes fastml suitable for both rapid experimentation and carefully optimized research workflows.
It enables reproducible, side-by-side evaluation of modeling engines under a unified, transparent framework.

14.2 Setting Fixed Engine Parameters

In some situations, you may want to set specific model parameters manually instead of tuning them across a search grid.
To do this, include the desired fixed parameters inside the tune_params argument and enable use_default_tuning = TRUE.
This ensures that fastml uses your predefined values during training without performing a parameter search.

In the example below, we instruct the ranger engine of the Random Forest algorithm to build exactly 1000 trees and compute impurity-based variable importance.

All main model arguments are specified directly under tune_params:

# Set fixed engine parameters
model_fixed_params <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "rand_forest",
  algorithm_engines = list(rand_forest = "ranger"),
  
  # Provide fixed engine arguments inside tune_params
  tune_params = list(
    rand_forest = list(
      ranger = list(
        trees = 1000,
        importance = "impurity"
      )
    )
  ),
  use_default_tuning = TRUE  # required to apply tune_params
)

# Inspect model parameters
summary(model_fixed_params, type = "params")
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (ranger) 
#>   mtry: 1
#>   trees: 1000
#>   min_n: 2

This approach ensures reproducibility and consistency across runs by keeping model hyperparameters constant. It is particularly useful for standardized benchmarking, simulation studies, or sensitivity analyses where tuning variability is undesirable.

15 Using Advanced Backends: H2O

fastml extends beyond the standard parsnip modeling engines by supporting advanced, high-performance backends such as H2O.
H2O is an open-source, in-memory, distributed machine learning platform designed for scalability and speed,
capable of efficiently handling large datasets that exceed the limits of traditional in-memory R modeling workflows.

By integrating H2O directly, fastml enables seamless access to its optimized implementations of algorithms such as:

H2O Random Forest (rand_forest, engine = "h2o")
H2O Gradient Boosting Machine (GBM) (boost_tree, engine = "h2o")
H2O Deep Learning (mlp, engine = "h2o")
H2O Generalized Linear Model (GLM) (linear_reg, engine = "h2o")

Using H2O through fastml provides parallelized training, automatic data distribution across CPU cores,
and highly efficient memory management—all while maintaining the same simple interface as standard engines.

15.1 One-Time Setup

Before using H2O as a backend, ensure that both the h2o and agua packages are installed.
The h2o package provides access to the distributed machine learning platform,
while the agua package serves as a bridge between parsnip (and thus fastml) and the H2O framework.

These packages only need to be installed once per R environment.

# 1. Install the packages
install.packages("h2o")
install.packages("agua") # <-- This is the new, required package

After installation, fastml automatically detects available H2O engines and registers them for all compatible algorithms.
No additional configuration is needed — once h2o and agua are installed, the backend is ready for use.

This integration allows you to train high-performance, distributed models using the same unified fastml syntax.
Whether you are building tree ensembles, deep neural networks, or large-scale GLMs,
the interface remains consistent, while computation is offloaded to the H2O engine for efficiency and scalability.

# 2. Load and initialize the H2O cluster
library(h2o)
library(agua)  # must be loaded to enable H2O engines in parsnip/fastml

# Initialize the H2O backend
h2o.init()
#> 
#> H2O is not running yet, starting it now...
#> 
#> Note:  In case of errors look at the following log files:
#>     /var/folders/dr/pwksczrd3gg7sxbphrjs5twh0000gn/T//RtmpEXAbKV/file240f2d00b856/h2o_selcukkorkmaz_started_from_r.out
#>     /var/folders/dr/pwksczrd3gg7sxbphrjs5twh0000gn/T//RtmpEXAbKV/file240fa462d2f/h2o_selcukkorkmaz_started_from_r.err
#> 
#> 
#> Starting H2O JVM and connecting: .... Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         3 seconds 331 milliseconds 
#>     H2O cluster timezone:       Europe/Istanbul 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.46.0.8 
#>     H2O cluster version age:    21 days, 5 hours and 53 minutes 
#>     H2O cluster name:           H2O_started_from_R_selcukkorkmaz_htm225 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   4.00 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.3.2 (2023-10-31)

Once initialized, H2O starts a local in-memory cluster that handles data storage, model training, and parallel computation.
The console output confirms successful initialization, displaying details such as available CPU cores, total memory, cluster name, and node configuration.

After this setup, fastml can seamlessly communicate with the H2O environment,
allowing you to train and evaluate large-scale models using distributed computation.
All H2O-based algorithms can now be accessed through the same unified fastml interface —
bringing scalability and performance without changing your workflow.

15.2 Running an H2O Model

Once H2O is initialized and the agua package is loaded, you can specify "h2o" as the engine for any supported algorithm — for example, rand_forest.
fastml automatically converts your input data frame into an H2OFrame and handles model training within the H2O cluster.

Your helper scripts (such as plot.fastml.R and train_models.R) include logic to correctly manage H2O-specific behaviors, including different naming conventions for prediction columns.

# This code assumes that 'iris_binary' was created in Section 4.
# If your R session was restarted, re-run this setup.

# Start the H2O cluster
h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         3 seconds 468 milliseconds 
#>     H2O cluster timezone:       Europe/Istanbul 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.46.0.8 
#>     H2O cluster version age:    21 days, 5 hours and 53 minutes 
#>     H2O cluster name:           H2O_started_from_R_selcukkorkmaz_htm225 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   4.00 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.3.2 (2023-10-31)

# Train a Random Forest model using the H2O engine
model_h2o <- fastml(
  data = iris_binary,
  label = "Species",
  algorithms = "rand_forest",
  algorithm_engines = list(rand_forest = "h2o")
)

# Summarize model performance
summary(model_h2o)
#> 
#> ===== fastml Model Summary =====
#> Task: classification 
#> Number of Models Trained: 1 
#> Best Model(s): rand_forest (h2o) (accuracy: 0.9500000) 
#> 
#> Performance Metrics (Sorted by accuracy):
#> 
#> --------------------------------------------------------------------------------------------- 
#> Model         Engine  Accuracy  F1 Score  Kappa  Precision  Sensitivity  Specificity  ROC AUC 
#> --------------------------------------------------------------------------------------------- 
#> rand_forest*  h2o     0.950     0.947     0.900  1.000      0.900        1.000        0.990   
#> --------------------------------------------------------------------------------------------- 
#> (*Best model)
#> 
#> Best Model hyperparameters:
#> 
#> Model: rand_forest (h2o) 
#>   mtry: 2
#>   trees: 50
#>   min_n: 2
#> 
#> 
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#> 
#> Model: rand_forest (h2o) 
#> ---------------------------
#>             Truth
#> Prediction   versicolor virginica
#>   versicolor          9         0
#>   virginica           1        10

# Visualize comparison metrics
plot(model_h2o, type = "bar")


# Shut down the H2O cluster when finished
h2o.shutdown(prompt = FALSE)

Once initialized, fastml communicates directly with the H2O backend, preserving the same simple and consistent interface used for all other engines.
Under the hood, data are automatically converted to H2OFrames, and training is distributed across available CPU cores for faster computation.

The summary() and plot() functions operate exactly as they do with standard engines,
allowing you to evaluate and visualize model performance without changing your workflow.
This seamless integration enables users to scale from small, local datasets to large, distributed environments with no additional configuration.

16 Model Interpretation with fastexplain

Training a model is only the first step — understanding why it makes certain predictions is equally important for building trust, diagnosing issues, and ensuring responsible deployment.

The fastexplain function in fastml serves as a convenient, automated interface for model-agnostic interpretability.
It leverages the DALEX package to generate multiple complementary explanations in a single call, making interpretation accessible and consistent across models.

By default, method = "dalex" is used, which combines three essential interpretability tools:

Feature Importance — Quantifies each variable’s contribution to model performance.
Partial Dependence Profiles — Illustrates how predictions change as a single feature varies.
SHAP (Shapley) Values — Provides instance-level explanations showing how features influence individual predictions.

Let’s apply fastexplain() to the regression model (model_reg) we trained in Section 6.

Note: The DALEX package must be installed to enable this functionality.

install.packages("DALEX")

When you call fastexplain(), it automatically performs a sequence of interpretation steps for the best-performing model within the fastml object:

Creates a DALEX explainer for the top-ranked model.
Computes and visualizes Permutation-Based Variable Importance.
Calculates and plots SHAP Value Summaries for local interpretability.
Generates Partial Dependence Profiles for specified features (if provided via the features argument).

Let’s demonstrate this on our pbc regression model, visualizing how bilirubin (bili) and age influence the predicted serum albumin levels.

  model_reg <- fastml(
    data = pbc_baseline,
    label = "albumin",
    algorithms = c("xgboost"),
    metric = "rmse",
    impute_method = "medianImpute"
  )

  
explain_model =fastexplain(
    model_reg,
    method = "dalex",
    features = c("bili", "age"),  # specify variables for partial dependence
    shap_sample = 20,          # number of samples for SHAP computation
    vi_iterations = 15         # number of iterations for permutation importance
  )
#> Preparation of a new explainer is initiated
#>   -> model label       :  xgboost 
#>   -> data              :  249  rows  19  cols 
#>   -> target variable   :  249  values 
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package , ver. , task regression 
#>   -> predicted values  :  numerical, min =  2.56859 , mean =  2.890936 , max =  3.021393  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -0.8480468 , mean =  0.6404692 , max =  1.618607  
#>   A new explainer has been created!  
#> 
#> === DALEX Variable Importance (with Boxplots) ===

#> 
#> === DALEX Model Profiles (Partial Dependence) ===
#> 
#> === DALEX Shapley Values (SHAP) ===

This single call creates a complete interpretability dashboard.
When you execute fastexplain(), it automatically performs and displays the interpretability workflow in your console.

The output log confirms the process step-by-step, typically including messages such as:

"A new explainer has been created!" — indicating that a DALEX explainer has been successfully initialized for the selected model (e.g., xgboost).
"DALEX Variable Importance" — computation and plotting of permutation-based feature importance.
"DALEX Model Profiles" — generation of partial dependence profiles for selected features.
"DALEX Shapley values (SHAP)" — calculation and visualization of feature-level contribution summaries.

All plots are automatically rendered to the console for immediate inspection.
However, fastexplain() also returns an object (commonly stored as explain_model) that contains the underlying data from each explanation step.

This allows for deeper exploration and customized visualization.

Variable Importance:
A permutation-based importance plot shows how much each feature contributes to the degradation in model performance (loss) when its values are randomly shuffled.
Features causing a greater increase in loss are more influential, as they disrupt the model’s predictive structure when permuted.

In this example, the explain_model$variable_importance data indicates that edema, id, and stage are the most critical predictors.

explain_model$variable_importance
#>        variable mean_dropout_loss   label
#> 1  _full_model_         0.7329619 xgboost
#> 2        status         0.7317232 xgboost
#> 3       protime         0.7324866 xgboost
#> 4          trig         0.7327706 xgboost
#> 5        copper         0.7329156 xgboost
#> 6           trt         0.7329619 xgboost
#> 7           age         0.7329619 xgboost
#> 8       spiders         0.7329619 xgboost
#> 9         sex_f         0.7329619 xgboost
#> 10        stage         0.7331607 xgboost
#> 11         chol         0.7333779 xgboost
#> 12        edema         0.7337676 xgboost
#> 13     alk.phos         0.7344536 xgboost
#> 14          ast         0.7345508 xgboost
#> 15       hepato         0.7345674 xgboost
#> 16           id         0.7350603 xgboost
#> 17     platelet         0.7360618 xgboost
#> 18      ascites         0.7364314 xgboost
#> 19         time         0.7364955 xgboost
#> 20         bili         0.7405471 xgboost
#> 21   _baseline_         0.7670726 xgboost

These variables exhibit the largest mean_dropout_loss values — for instance, edema shows a mean dropout loss of approximately 0.730, meaning that shuffling this feature leads to a substantial decline in predictive accuracy.

This confirms that the model relies heavily on these variables when estimating outcomes, making them key drivers of the prediction mechanism.

SHAP Summary:
A bar plot of the mean absolute SHAP values quantifies the average magnitude of each feature’s contribution to model predictions.
This provides a global view of how strongly each variable influences the output, regardless of direction.

The explain_model$shap_values data frame contains the underlying statistics for these contributions.
For example, in this model, the feature stage exhibits a strong average negative contribution to the predicted albumin levels (mean ≈ -0.056), while edema shows a notable positive contribution (mean ≈ 0.013).

explain_model$shap_values
#>                                      min           q1        median          mean            q3           max
#> xgboost: age = -0.552        0.000000000  0.000000000  0.0000000000  0.0000000000  0.0000000000  0.0000000000
#> xgboost: alk.phos = 0.05379  0.001387102  0.002700264  0.0027545791  0.0026821595  0.0029252495  0.0038771486
#> xgboost: ascites = -0.2615   0.005510047  0.005510047  0.0055100468  0.0055100468  0.0055100468  0.0055100468
#> xgboost: ast = -0.3443      -0.004132973 -0.003937656 -0.0033596044 -0.0027815527 -0.0010235702 -0.0009005185
#> xgboost: bili = -0.3019      0.020106583  0.021965659  0.0232730065  0.0233773643  0.0250576775  0.0268616016
#> xgboost: chol = 0.1443       0.000000000  0.000000000  0.0000000000  0.0006944097  0.0017360243  0.0017360243
#> xgboost: copper = -0.653     0.014296399  0.014296399  0.0227879358  0.0190516594  0.0227879358  0.0227879358
#> xgboost: edema = -0.3781     0.000000000  0.000000000  0.0005734783  0.0011469565  0.0018109040  0.0031800873
#> xgboost: hepato = 0.9703    -0.012540125 -0.012540125 -0.0111928388 -0.0116778618 -0.0111928388 -0.0111928388
#> xgboost: id = 0.2328         0.004901205  0.006248491  0.0065847736  0.0063963692  0.0065847736  0.0079320597
#> xgboost: platelet = 0.1727   0.005416969  0.006847279  0.0069658546  0.0073312484  0.0078949179  0.0086953736
#> xgboost: protime = -0.8004   0.000000000  0.000000000  0.0008198774  0.0008409430  0.0021155235  0.0021155235
#> xgboost: sex_f = 0.3408      0.000000000  0.000000000  0.0000000000  0.0000000000  0.0000000000  0.0000000000
#> xgboost: spiders = -0.6241   0.000000000  0.000000000  0.0000000000  0.0000000000  0.0000000000  0.0000000000
#> xgboost: stage = 1.113      -0.008491537 -0.008491537 -0.0084915372 -0.0047552608  0.0000000000  0.0000000000
#> xgboost: status = -0.8448    0.000000000  0.000000000  0.0030313014  0.0025049399  0.0048533089  0.0054591822
#> xgboost: time = 0.4705       0.018564520  0.021138239  0.0232234762  0.0234712759  0.0260295897  0.0293344762
#> xgboost: trig = 0.2066       0.000000000  0.000000000  0.0000000000  0.0002181144  0.0006058733  0.0006058733
#> xgboost: trt = 1.035         0.000000000  0.000000000  0.0000000000  0.0000000000  0.0000000000  0.0000000000

Together, these patterns reveal how different predictors systematically push the model’s predictions higher or lower,
offering interpretable insight into the biological or clinical relevance of each feature.

Partial Dependence:
Line plots (for example, for bili and age) visualize how the model’s predicted albumin values change as each feature varies, while all other variables are held constant.
These plots capture the marginal effect of each feature on the prediction, helping identify monotonic relationships, nonlinear trends, or thresholds.

The explain_model$model_profiles object contains the raw data underlying these plots.
Within it, the $agr_profiles table provides the average predicted response (_yhat_, corresponding to albumin) for different values (_x_) of each feature —
these averaged predictions form the basis of the partial dependence curves.

explain_model$model_profiles
#> Top profiles    : 
#>   _vname_ _label_       _x_   _yhat_ _ids_
#> 1     age xgboost -2.263518 2.869726     0
#> 2     age xgboost -1.912508 2.869726     0
#> 3     age xgboost -1.820323 2.869726     0
#> 4     age xgboost -1.729347 2.869726     0
#> 5     age xgboost -1.650627 2.869726     0
#> 6     age xgboost -1.596269 2.869726     0

Inspecting this table allows for precise, data-level understanding of how each feature influences model output beyond the visual summaries.

These visualizations form a comprehensive, model-agnostic explanation suite — clarifying both which features matter most and how they influence the predictions produced by the trained model.

Get Started with fastml

Selçuk Korkmaz

2025-10-29