The fastml package provides a unified and efficient framework for training, evaluating, and comparing multiple machine learning models in R. It is designed to minimize repetitive coding and automate essential steps of a typical machine learning workflow.
With a single, consistent interface, fastml enables researchers and data scientists to perform end-to-end analysis — from data preprocessing to model evaluation — with minimal manual intervention.
Key features include:
Comprehensive Data Preprocessing:
Automatically handles missing values, encodes categorical variables, and applies user-specified normalization or scaling methods.
Multi-Algorithm Support:
Trains and compares a broad range of models such as Random Forests, XGBoost, Support Vector Machines, Neural Networks, and Generalized Linear Models with a single function call.
Task Auto-Detection:
Detects the nature of the modeling task—classification, regression, or survival analysis—based on the provided outcome variable.
Flexible Hyperparameter Tuning:
Supports both default and user-defined tuning strategies, including grid search, random search, and Bayesian optimization.
Comprehensive Evaluation and Visualization:
Generates detailed performance metrics, confusion matrices, and diagnostic plots such as ROC curves, residual plots, and feature importance visualizations.
This vignette introduces the main functionality of fastml through practical examples. You will learn how to set up your data, train multiple models, fine-tune them, and interpret their results efficiently.
The R ecosystem — particularly the tidymodels framework — provides a rich, modular toolkit for building sophisticated and fully customized modeling workflows.
While this flexibility is ideal for in-depth research, it can sometimes feel cumbersome when your goal is to obtain a fast, reliable baseline model.
In many applied settings, the objective is not to engineer the most complex pipeline, but to rapidly compare several well-established algorithms under consistent preprocessing and evaluation procedures.
This is where fastml excels.
The philosophy behind fastml is simple: automate the most common 80% of the modeling workflow so you can focus on interpretation, insight, and decision-making rather than boilerplate code.
The core function, fastml(), provides an opinionated but extensible one-stop interface that automatically performs:
Task Auto-Detection:
Automatically detects whether you are performing classification, regression, or survival analysis based on your target variable (label).
Data Splitting and Stratification:
Creates reproducible training and testing partitions, optionally stratified by class labels.
Preprocessing Pipeline:
Builds a robust recipe that handles missing values, encodes categorical variables, and applies standardization or scaling as needed.
Multi-Algorithm Training:
Trains multiple algorithms in parallel — from classical GLMs to ensemble and tree-based models — with consistent cross-validation.
Hyperparameter Tuning and Evaluation:
Supports efficient tuning strategies and delivers comprehensive performance summaries.
In essence, fastml brings together best practices from the tidymodels ecosystem into a concise and reproducible interface — allowing you to move seamlessly from a raw data frame to a transparent model comparison in a single step.
You can install the stable version of fastml from CRAN:
# Install just the core package
install.packages("fastml")
# Install with all model dependencies (recommended)
install.packages("fastml", dependencies = TRUE)
If you prefer to use the most recent development version with the latest updates and features, you can install it from GitHub:
install.packages("devtools")
devtools::install_github("selcukorkmaz/fastml")
After installation, load the package:
library(fastml)
You are now ready to begin building and evaluating machine learning models with fastml.
Let’s begin with a simple binary classification example using the classic iris dataset.
Here, we will predict whether a flower is versicolor or virginica based on its sepal and petal measurements.
We first prepare and explore the data using tidyverse tools before passing it to fastml().
data(iris)
iris_binary <- iris %>%
filter(Species != "setosa") %>%
mutate(Species = factor(Species)) # Ensure label is a factor
# Optional: quick exploratory plot
iris_binary %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
labs(title = "Exploring Our Data")
Now, we pass this clean iris_binary data frame to fastml(). Notice we only provide the label. fastml inspects the Species column and, seeing it’s a factor, automatically detects a classification task. In this example, we train two algorithms — a Random Forest and a Logistic Regression model.
model_class <- fastml(
data = iris_binary,
label = "Species",
algorithms = c("rand_forest", "logistic_reg")
)
After training, we can inspect the performance results. The summary() function provides an overview of all trained models, ranked by their primary evaluation metric (typically accuracy or AUC).
summary(model_class)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 2
#> Best Model(s): rand_forest (ranger) (accuracy: 0.9500000)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ---------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ---------------------------------------------------------------------------------------------
#> rand_forest* ranger 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> logistic_reg glm 0.900 0.889 0.800 1.000 0.800 1.000 0.900
#> ---------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 2
#> trees: 500
#> min_n: 10
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: rand_forest (ranger)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
The model marked with * in the summary output represents the best-performing model based on the selected evaluation criterion. This quick workflow demonstrates how fastml enables rapid, consistent model benchmarking with minimal code.
fastml includes built-in visualization tools that allow quick inspection and comparison of model performance.
Depending on the analysis type, different plots are available for classification, regression, and survival tasks.
For classification problems, the "bar" and "roc" plot types are the most commonly used.
# Plot the performance metrics
plot(model_class, type = "bar")
# Plot ROC curves
plot(model_class, type = "roc")
The bar plot summarizes key performance metrics (e.g., Accuracy, AUC, F1) across all trained models, helping identify the top performer at a glance. The ROC plot illustrates how well each classifier separates the two classes across varying thresholds.
However, a high ROC AUC does not necessarily mean that the model’s predicted probabilities are well calibrated. For example, a model may predict “80% chance of being Virginica,” but the event occurs only 60% of the time. To assess calibration quality, you can use the “calibration” plot:
plot(model_class, type = "calibration")
The calibration plot compares predicted probabilities with observed frequencies. A perfectly calibrated model will follow the diagonal line, while deviations indicate over- or under-confidence in the model’s predictions.
fastml supports a broad set of algorithms covering classification, regression, and survival analysis tasks.
These include both traditional statistical models and modern machine learning algorithms such as Random Forests, Gradient Boosting, and Neural Networks.
To view the available algorithms for each task type, use the helper function availableMethods().
# List all supported classification algorithms
availableMethods("classification")
#> [1] "logistic_reg" "multinom_reg" "decision_tree" "C5_rules" "rand_forest"
#> [6] "xgboost" "lightgbm" "svm_linear" "svm_rbf" "nearest_neighbor"
#> [11] "naive_Bayes" "mlp" "discrim_linear" "discrim_quad" "bag_tree"
# List all supported regression algorithms
availableMethods("regression")
#> [1] "linear_reg" "ridge_reg" "lasso_reg" "elastic_net" "decision_tree"
#> [6] "rand_forest" "xgboost" "lightgbm" "svm_linear" "svm_rbf"
#> [11] "nearest_neighbor" "mlp" "pls" "bayes_glm"
# List all supported survival analysis algorithms
availableMethods("survival")
#> [1] "rand_forest" "cox_ph" "penalized_cox" "stratified_cox" "time_varying_cox"
#> [6] "survreg" "royston_parmar" "parametric_surv" "piecewise_exp" "xgboost"
This function returns the algorithm names recognized by fastml, which can be passed directly to the algorithms argument of the fastml() function.
In many cases, you may want to benchmark multiple algorithms to identify the best-performing model for your dataset.
Instead of manually specifying individual algorithms, you can simply set algorithms = "all".
When this option is used, fastml automatically runs every supported model for the selected task (classification, regression, or survival analysis).
The summary() output then ranks all models according to their key performance metric.
# This process may take several minutes depending on data size and models
model_battle <- fastml(
data = iris_binary,
label = "Species",
algorithms = "all",
)
#> Setting default kernel parameters
# Display a ranked summary of all trained models
summary(model_battle)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 14
#> Best Model(s): C5_rules (C5.0) rand_forest (ranger) xgboost (xgboost) svm_rbf (kernlab) naive_Bayes (klaR) discrim_linear (MASS) discrim_quad (MASS) (accuracy: 0.9500000)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ---------------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ---------------------------------------------------------------------------------------------------
#> C5_rules* C5.0 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> rand_forest* ranger 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> xgboost* xgboost 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> svm_rbf* kernlab 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> naive_Bayes* klaR 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> discrim_linear* MASS 0.950 0.947 0.900 1.000 0.900 1.000 0.990
#> discrim_quad* MASS 0.950 0.947 0.900 1.000 0.900 1.000 0.990
#> logistic_reg glm 0.900 0.889 0.800 1.000 0.800 1.000 0.900
#> decision_tree rpart 0.900 0.889 0.800 1.000 0.800 1.000 0.900
#> lightgbm lightgbm 0.900 0.889 0.800 1.000 0.800 1.000 0.980
#> svm_linear kernlab 0.900 0.889 0.800 1.000 0.800 1.000 1.000
#> nearest_neighbor kknn 0.900 0.889 0.800 1.000 0.800 1.000 0.950
#> mlp nnet 0.900 0.889 0.800 1.000 0.800 1.000 0.990
#> bag_tree rpart 0.900 0.889 0.800 1.000 0.800 1.000 1.000
#> ---------------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: C5_rules (C5.0)
#> mtry:
#> trees: 50
#> min_n: 5
#> tree_depth:
#> learn_rate:
#> loss_reduction:
#> sample_size: 0.5
#> stop_iter:
#>
#> Model: rand_forest (ranger)
#> mtry: 2
#> trees: 500
#> min_n: 10
#>
#> Model: xgboost (xgboost)
#> mtry: 2
#> trees: 15
#> min_n: 2
#> tree_depth: 6
#> learn_rate: 0.1
#> loss_reduction: 0
#> sample_size: 0.5
#> stop_iter:
#>
#> Model: svm_rbf (kernlab)
#> cost: 1
#> rbf_sigma: 0.1
#> margin:
#>
#> Model: naive_Bayes (klaR)
#> smoothness: 1
#> Laplace: 0
#>
#> Model: discrim_linear (MASS)
#> penalty:
#> regularization_method:
#>
#> Model: discrim_quad (MASS)
#> regularization_method:
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: C5_rules (C5.0)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: rand_forest (ranger)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: xgboost (xgboost)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: svm_rbf (kernlab)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: naive_Bayes (klaR)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: discrim_linear (MASS)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: discrim_quad (MASS)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
This approach provides a quick, reproducible “model battle royale” that highlights which algorithms perform best on your data. It is especially useful for exploratory analysis or as a first step before focusing on fine-tuning specific models.
fastml automatically detects a regression task when the outcome variable (label) is numeric.
In this example, we use the classic mtcars dataset to predict miles per gallon (mpg) based on engine and vehicle characteristics.
The model performance will be evaluated using Root Mean Squared Error (RMSE) as the optimization metric.
# 1. Prepare data
data(pbc, package = "survival")
# The pbc dataset has two parts; we only want the baseline data (rows 1-312)
pbc_baseline <- pbc[1:312, ]
# 2. Train regression models
# We'll compare a Random Forest and an XGBoost model
model_reg <- fastml(
data = pbc_baseline,
label = "albumin",
algorithms = c("rand_forest", "xgboost"),
metric = "rmse", # Optimize for RMSE
impute_method = "remove" # Remove missing values
)
# 3. Summarize the results
summary(model_reg)
#>
#> ===== fastml Model Summary =====
#> Task: regression
#> Number of Models Trained: 2
#> Best Model(s): rand_forest (ranger) (rmse: 0.3861062)
#>
#> Performance Metrics (Sorted by rmse):
#>
#> ----------------------------------------------
#> Model Engine RMSE R-squared MAE
#> ----------------------------------------------
#> rand_forest* ranger 0.386 0.189 0.326
#> xgboost xgboost 0.772 0.170 0.681
#> ----------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 4
#> trees: 500
#> min_n: 5
The summary output lists all trained models ranked by their RMSE values, where lower scores indicate better predictive accuracy. As with classification tasks, fastml standardizes data preprocessing and cross-validation, ensuring fair model comparison. You can visualize regression performance further using residual plot:
# Examine residual distribution
plot(model_reg, type = "residual")
#>
#> Residual Diagnostics for Best Model:
These plots help assess whether errors are randomly distributed and whether any model exhibits systematic bias.
fastml provides native support for survival analysis, enabling the use of both classical and modern flexible models.
A survival workflow is automatically triggered when the label argument includes two variables: time and status.
The package seamlessly integrates multiple survival modeling approaches, including:
survival package)rstpm2 package)flexsurvreg package)In this example, we’ll fit a Cox Proportional-Hazards, a Weibull parametric, and an XGBoost survival model using the lung dataset.
# 1. Prepare data
library(survival)
data(lung, package = "survival")
# 2. Train survival models
model_surv <- fastml(
data = lung,
label = c("time", "status"), # triggers survival analysis
algorithms = c("cox_ph", "parametric_surv", "xgboost"),
impute_method = "medianImpute", # handle missing values
# Specify the distribution for the parametric model
engine_params = list(
parametric_surv = list(
flexsurvreg = list(dist = "weibull")
)
)
)
# 3. Summarize model performance
# Metrics include Harrell’s C-index, Uno’s C, and Integrated Brier Score (IBS)
summary(model_surv)
#>
#> ===== fastml Model Summary =====
#> Task: survival
#> Number of Models Trained: 3
#> Best Model(s): parametric_surv (flexsurvreg) (ibs: 0.2160995)
#>
#> Performance Metrics (Sorted by ibs):
#>
#> -------------------------------------------------------------------------------------------------------------------------------------
#> Model Engine Harrell C-index Uno's C-index Integrated Brier Score RMST diff (t<=567) Brier(t=292) Brier(t=400)
#> -------------------------------------------------------------------------------------------------------------------------------------
#> parametric_surv* flexsurvreg 0.599 0.445 0.216 -15.131 0.276 0.280
#> cox_ph survival 0.368 0.667 0.217 89.855 0.280 0.284
#> xgboost aft 0.361 0.668 0.284 124.897 0.362 0.344
#> -------------------------------------------------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: parametric_surv (flexsurvreg)
#> Distribution: weibull
#> Coefficients (link scale):
#> coef
#> shape 0.3622
#> scale 6.005
#> inst 0.09212
#> age -0.06222
#> sex 0.2219
#> ph.ecog -0.3645
#> ph.karno -0.1745
#> pat.karno 0.1441
#> meal.cal 0.006096
#> wt.loss 0.02948
#> Parameter estimates:
#> est L95% U95% se
#> shape 1.437 1.257 1.642 0.09795
#> scale 405.6 358.9 458.3 25.28
#> inst 0.09212 -0.04101 0.2252 0.06792
#> age -0.06222 -0.188 0.06354 0.06416
#> sex 0.2219 0.09421 0.3496 0.06514
#> ph.ecog -0.3645 -0.5686 -0.1603 0.1042
#> ph.karno -0.1745 -0.3459 -0.003112 0.08744
#> pat.karno 0.1441 -0.004938 0.2932 0.07604
#> meal.cal 0.006096 -0.1354 0.1476 0.07219
#> wt.loss 0.02948 -0.0972 0.1562 0.06463
#> Log-likelihood: -915
#> AIC: 1850
#> BIC: 1882
#> Sample size: 200 (events = 100, censored = 50)
The summary output for survival models provides specialized performance metrics that evaluate both discrimination and calibration.
To inspect detailed model coefficients or distribution parameters, you can use the type = "params" option:
summary(model_surv, type = "params", algorithm = "cox_ph")
#> Selected Model hyperparameters:
#>
#> Model: cox_ph (survival)
#> Coefficients (coef):
#> coef
#> inst -0.1243
#> age 0.1005
#> sex -0.3274
#> ph.ecog 0.5268
#> ph.karno 0.2303
#> pat.karno -0.2028
#> meal.cal -0.005052
#> wt.loss -0.03865
#> exp(coef):
#> exp(coef)
#> inst 0.8831
#> age 1.106
#> sex 0.7208
#> ph.ecog 1.693
#> ph.karno 1.259
#> pat.karno 0.8164
#> meal.cal 0.995
#> wt.loss 0.9621
#> Hazard Ratios (95% CI):
#> HR Lower 95% Upper 95%
#> inst 0.8831 0.73 1.068
#> age 1.106 0.9235 1.324
#> sex 0.7208 0.5993 0.867
#> ph.ecog 1.693 1.253 2.29
#> ph.karno 1.259 0.9793 1.618
#> pat.karno 0.8164 0.6581 1.013
#> meal.cal 0.995 0.811 1.221
#> wt.loss 0.9621 0.8015 1.155
#> Likelihood ratio test: 35.84 on 8 df (p = 0.00001875)
#> Concordance (Harrell C-index): 0.3676
These outputs allow detailed inspection of model components and help compare the interpretability and flexibility of different survival approaches within a unified fastml workflow.
In many real-world classification problems, one class is much less frequent than the other.
This class imbalance can lead to biased models that favor the majority class and underperform on minority observations.
To mitigate this, fastml provides the balance_method argument, which controls how the training data is balanced before model fitting.
You can specify one of the following options:
"upsample" — randomly duplicates minority-class samples to achieve balance."downsample" — randomly removes majority-class samples to achieve balance."none" — disables balancing and uses the data as is (default).Let’s demonstrate this feature using the BreastCancer dataset from the mlbench package, where the goal is to predict tumor type (benign vs malignant).
library(dplyr)
library(mlbench)
# Load and prepare data
data(BreastCancer)
bc_data <- BreastCancer %>%
select(-Id) # remove non-predictor column
# Examine class distribution
table(bc_data$Class)
#>
#> benign malignant
#> 458 241
# Preview the data structure
head(bc_data)
#> Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
#> 1 5 1 1 1 2 1 3 1 1
#> 2 5 4 4 5 7 10 3 2 1
#> 3 3 1 1 1 2 2 3 1 1
#> 4 6 8 8 1 3 4 3 7 1
#> 5 4 1 1 3 2 1 3 1 1
#> 6 8 10 10 8 7 10 9 7 1
#> Class
#> 1 benign
#> 2 benign
#> 3 benign
#> 4 benign
#> 5 benign
#> 6 malignant
Case 1: No Balancing (The Baseline)
First, we train on the original, imbalanced data.
model_none <- fastml(
data = bc_data,
label = "Class",
algorithms = "logistic_reg",
impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
balance_method = "none" # Key argument!
)
# Show the summary
summary(model_none)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 1
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ----------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ----------------------------------------------------------------------------------------------
#> logistic_reg* glm 0.949 0.961 0.887 0.945 0.977 0.898 0.959
#> ----------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: logistic_reg (glm)
#> penalty:
#> mixture:
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: logistic_reg (glm)
#> ---------------------------
#> Truth
#> Prediction benign malignant
#> benign 86 5
#> malignant 2 44
Case 2: Downsampling
Next, we train a model after downsampling the majority (benign) class to match the size of the minority (malignant) class.
model_down <- fastml(
data = bc_data,
label = "Class",
algorithms = "logistic_reg",
impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
balance_method = "downsample" # Key argument!
)
# Show the summary
summary(model_down)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 1
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ----------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ----------------------------------------------------------------------------------------------
#> logistic_reg* glm 0.949 0.961 0.887 0.945 0.977 0.898 0.962
#> ----------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: logistic_reg (glm)
#> penalty:
#> mixture:
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: logistic_reg (glm)
#> ---------------------------
#> Truth
#> Prediction benign malignant
#> benign 86 5
#> malignant 2 44
Case 3: Upsampling
Finally, we train a model after upsampling the minority (malignant) class to match the size of the majority (benign) class.
model_up <- fastml(
data = bc_data,
label = "Class",
algorithms = "logistic_reg",
impute_method = "medianImpute", # Handle NAs in 'Bare.nuclei'
balance_method = "upsample" # Key argument!
)
# Show the summary
summary(model_up)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 1
#> Best Model(s): logistic_reg (glm) (accuracy: 0.9489051)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ----------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ----------------------------------------------------------------------------------------------
#> logistic_reg* glm 0.949 0.961 0.887 0.945 0.977 0.898 0.987
#> ----------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: logistic_reg (glm)
#> penalty:
#> mixture:
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: logistic_reg (glm)
#> ---------------------------
#> Truth
#> Prediction benign malignant
#> benign 86 5
#> malignant 2 44
Comparison
After training three models using different balancing strategies (none, downsample, and upsample), we can compare their performance metrics.
Although the confusion matrices appear similar for this dataset—due to the inherent stability of logistic regression—the ROC AUC scores reveal meaningful differences:
| Method | ROC AUC |
|---|---|
| none | 0.959 |
| downsample | 0.962 |
| upsample | 0.987 |
The ROC AUC metric evaluates how well the model ranks positive versus negative samples based on predicted probabilities.
Even when the discrete class predictions remain the same, differences in ROC AUC confirm that three distinct models were trained.
Among these, the model trained with upsampling achieved the highest ROC AUC, indicating superior discriminative performance.
This result highlights how addressing class imbalance—particularly through upsampling—can significantly improve the model’s ability to distinguish between benign and malignant cases in imbalanced medical datasets.
fastml can handle missing data using several strategies via the impute_method argument.
While "medianImpute" is fast, you can use more powerful (but slower) methods:
"knnImpute" — Uses K-Nearest Neighbors to impute missing values based on similarity."mice" — Performs Multivariate Imputation by Chained Equations, a flexible and statistically grounded method."missForest" — Employs Random Forests to iteratively impute missing data.The choice depends on dataset size, computational resources, and the degree of missingness.
Below, we demonstrate the use of MICE imputation with the lung dataset from the survival package.
library(survival)
data(lung, package = "survival")
# This code assumes you have the 'mice' package installed
# install.packages("mice")
model_mice <- fastml(
data = lung,
label = c("time", "status"),
algorithms = "penalized_cox",
impute_method = "mice" # Use MICE for imputation
)
#>
#> iter imp variable
#> 1 1 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1 2 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1 3 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1 4 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1 5 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 2 1 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 2 2 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 2 3 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 2 4 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 2 5 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 3 1 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 3 2 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 3 3 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 3 4 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 3 5 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 4 1 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 4 2 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 4 3 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 4 4 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 4 5 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 5 1 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 5 2 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 5 3 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 5 4 ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 5 5 ph.ecog ph.karno pat.karno meal.cal wt.loss
#>
#> iter imp variable
#> 1 1 inst pat.karno meal.cal wt.loss
#> 1 2 inst pat.karno meal.cal wt.loss
#> 1 3 inst pat.karno meal.cal wt.loss
#> 1 4 inst pat.karno meal.cal wt.loss
#> 1 5 inst pat.karno meal.cal wt.loss
#> 2 1 inst pat.karno meal.cal wt.loss
#> 2 2 inst pat.karno meal.cal wt.loss
#> 2 3 inst pat.karno meal.cal wt.loss
#> 2 4 inst pat.karno meal.cal wt.loss
#> 2 5 inst pat.karno meal.cal wt.loss
#> 3 1 inst pat.karno meal.cal wt.loss
#> 3 2 inst pat.karno meal.cal wt.loss
#> 3 3 inst pat.karno meal.cal wt.loss
#> 3 4 inst pat.karno meal.cal wt.loss
#> 3 5 inst pat.karno meal.cal wt.loss
#> 4 1 inst pat.karno meal.cal wt.loss
#> 4 2 inst pat.karno meal.cal wt.loss
#> 4 3 inst pat.karno meal.cal wt.loss
#> 4 4 inst pat.karno meal.cal wt.loss
#> 4 5 inst pat.karno meal.cal wt.loss
#> 5 1 inst pat.karno meal.cal wt.loss
#> 5 2 inst pat.karno meal.cal wt.loss
#> 5 3 inst pat.karno meal.cal wt.loss
#> 5 4 inst pat.karno meal.cal wt.loss
#> 5 5 inst pat.karno meal.cal wt.loss
summary(model_mice, type = "metrics")
#>
#> ===== fastml Model Summary =====
#> Task: survival
#> Number of Models Trained: 1
#> Best Model(s): penalized_cox (glmnet) (ibs: 0.2213755)
#>
#> Performance Metrics (Sorted by ibs):
#>
#> ------------------------------------------------------------------------------------------------------------------------------
#> Model Engine Harrell C-index Uno's C-index Integrated Brier Score RMST diff (t<=567) Brier(t=292) Brier(t=400)
#> ------------------------------------------------------------------------------------------------------------------------------
#> penalized_cox* glmnet 0.629 0.337 0.221 -48.383 0.283 0.291
#> ------------------------------------------------------------------------------------------------------------------------------
#> (*Best model)
Most machine learning algorithms include hyperparameters that control model complexity, regularization, and learning dynamics.
By default, fastml uses each model’s standard parameter settings (use_default_tuning = FALSE), which are usually adequate for quick benchmarking.
However, you can easily enable automated or fully customized hyperparameter tuning to optimize model performance.
To activate tuning, set use_default_tuning = TRUE.
When enabled, fastml either uses built-in tuning grids for each algorithm or applies your own custom grid if supplied through the tune_params argument.
You can define a tuning grid manually for fine-grained control over search values.
The expected structure is:
list(algorithm = list(engine = list(
param1 = c(values), param2 = c(values)
)))
Here’s an example tuning grid for the ranger engine of a Random Forest model:
# Define a custom grid for the 'ranger' engine of 'rand_forest'
my_tune_grid <- list(
rand_forest = list(
ranger = list(
mtry = c(1, 2, 3),
min_n = c(5, 10)
)
)
)
# Train model with custom tuning
model_custom_tune <- fastml(
data = iris_binary,
label = "Species",
algorithms = "rand_forest",
tune_params = my_tune_grid,
use_default_tuning = TRUE # Must be TRUE to enable tuning
)
# Inspect tuned parameters
summary(model_custom_tune, type = "params")
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 1
#> trees: 100
#> min_n: 5
When tuning is enabled, fastml automatically performs internal resampling and selects the parameter combination that yields the best validation performance. This allows efficient exploration of hyperparameter space with minimal manual coding, while maintaining compatibility with all supported algorithms and engines.
Grid search can be computationally expensive because it evaluates all parameter combinations.
To improve efficiency, fastml supports Bayesian optimization, which uses past evaluation results to guide the search toward promising parameter regions.
You can enable Bayesian tuning by setting tuning_strategy = "bayes" and use_default_tuning = TRUE.
The argument tuning_iterations controls how many optimization steps are performed.
# Bayesian hyperparameter tuning example
set.seed(123)
model_bayes <- fastml(
data = iris_binary,
label = "Species",
algorithms = "xgboost",
use_default_tuning = TRUE,
tuning_strategy = "bayes", # enable Bayesian optimization
tuning_iterations = 10 # number of search iterations
)
# Review the best-found parameters
summary(model_bayes, type = "params")
#> Best Model hyperparameters:
#>
#> Model: xgboost (xgboost)
#> mtry: 3
#> trees: 63
#> min_n: 3
#> tree_depth: 5
#> learn_rate: 0.01331
#> loss_reduction: 12.2
#> sample_size: 0.5313
#> stop_iter:
Bayesian optimization intelligently balances exploration (trying new areas of the parameter space) and exploitation (refining promising regions).
This often yields better models in fewer iterations compared to exhaustive grid or random search, making it suitable for complex models like XGBoost or neural networks where tuning spaces are large.
By default, fastml applies an internal preprocessing pipeline that includes data cleaning, encoding, and scaling.
However, in many research or production scenarios, you may need to define custom feature engineering steps or domain-specific transformations.
To support this flexibility, fastml seamlessly integrates with the tidymodels ecosystem.
You can pass your own untrained recipes object through the recipe argument, and fastml will use it for all models—skipping its internal preprocessing.
This approach allows you to combine the simplicity of fastml with the full power of tidymodels’ preprocessing framework.
library(recipes)
# 1. Define a custom tidymodels recipe
# Example: Normalize all numeric features and apply PCA
my_recipe <- recipe(Species ~ ., data = iris_binary) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), num_comp = 2)
# 2. Pass the custom recipe to fastml
# fastml will now use your recipe instead of its internal pipeline
model_recipe <- fastml(
data = iris_binary,
label = "Species",
recipe = my_recipe,
algorithms = c("rand_forest", "svm_rbf")
)
# 3. Summarize model performance
summary(model_recipe)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 2
#> Best Model(s): rand_forest (ranger) (accuracy: 0.9000000)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ----------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ----------------------------------------------------------------------------------------------
#> rand_forest* ranger 0.900 0.900 0.800 0.900 0.900 0.900 0.990
#> svm_rbf kernlab 0.850 0.824 0.700 1.000 0.700 1.000 1.000
#> ----------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 2
#> trees: 500
#> min_n: 10
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: rand_forest (ranger)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 1
#> virginica 1 9
When a custom recipe is supplied:
This integration allows you to combine the automation and multi-model comparison strengths of fastml with the flexibility and transparency of recipes.
It bridges quick experimentation with advanced, fully customized preprocessing workflows, preserving both reproducibility and analytical control.
By default, fastml automatically creates its own training, testing, and resampling splits.
However, in some cases — such as stratified sampling, blocked time-series splits, or custom validation designs — you may want full control over how the data is partitioned.
To do this, you can pass your own rsample object (e.g., vfold_cv, bootstraps, or mc_cv) to the resamples argument.
fastml will then use your predefined folds for training, validation, and tuning instead of generating its own.
library(rsample)
# 1. Create custom 5-fold cross-validation splits
my_folds <- vfold_cv(iris_binary, v = 5, strata = "Species")
# 2. Use the custom folds in fastml
model_custom_folds <- fastml(
data = iris_binary,
label = "Species",
algorithms = "svm_rbf",
resamples = my_folds, # pass custom resampling object
use_default_tuning = TRUE # tuning will use these folds
)
# 3. Inspect tuned parameters
summary(model_custom_folds, type = "params")
#> Best Model hyperparameters:
#>
#> Model: svm_rbf (kernlab)
#> cost: 1
#> rbf_sigma: 0.1
#> margin:
This approach is particularly useful when you need to:
By combining rsample objects with fastml, you gain both flexibility and reproducibility while retaining a streamlined modeling workflow.
This integration bridges the convenience of automated model benchmarking with the precision of fully customized resampling strategies.
Beyond hyperparameter tuning, fastml also provides direct control over the engine (the underlying R package used for model fitting) and any fixed engine-specific parameters.
This allows fine-grained customization, enabling side-by-side comparisons of different implementations of the same algorithm.
Many algorithms—such as Random Forests, Gradient Boosting, or SVMs—can be implemented using multiple back-end engines.
The algorithm_engines argument lets you specify which engines to use for each algorithm.
You can provide a vector of engine names for a single algorithm, and fastml will train and evaluate each engine separately under identical preprocessing and evaluation settings.
# Compare different Random Forest engines
model_engines <- fastml(
data = iris_binary,
label = "Species",
algorithms = "rand_forest",
# Run 'rand_forest' twice: once with 'ranger', once with 'randomForest'
algorithm_engines = list(
rand_forest = c("ranger", "randomForest")
)
)
# Summarize performance across engines
summary(model_engines)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 2
#> Best Model(s): rand_forest (ranger) rand_forest (randomForest) (accuracy: 0.9500000)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ---------------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ---------------------------------------------------------------------------------------------------
#> rand_forest* ranger 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> rand_forest* randomForest 0.950 0.947 0.900 1.000 0.900 1.000 1.000
#> ---------------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 2
#> trees: 500
#> min_n: 10
#>
#> Model: rand_forest (randomForest)
#> mtry: 2
#> trees: 500
#> min_n: 10
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: rand_forest (ranger)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
#>
#> Model: rand_forest (randomForest)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
# Visualize engine comparison
plot(model_engines, type = "bar")
Both the summary() and plot() functions automatically recognize engine-level differences, enabling direct comparison of performance metrics, runtime, and model characteristics.
This capability is particularly valuable for reproducibility studies, benchmarking different implementations, and evaluating performance trade-offs between older and newer engines.
Beyond selecting engines, you can fine-tune their behavior using the engine_params argument, which allows you to pass fixed arguments directly to the underlying modeling functions
(for example, num.trees, max.depth, or kernel).
This balance of control and automation makes fastml suitable for both rapid experimentation and carefully optimized research workflows.
It enables reproducible, side-by-side evaluation of modeling engines under a unified, transparent framework.
In some situations, you may want to set specific model parameters manually instead of tuning them across a search grid.
To do this, include the desired fixed parameters inside the tune_params argument and enable use_default_tuning = TRUE.
This ensures that fastml uses your predefined values during training without performing a parameter search.
In the example below, we instruct the ranger engine of the Random Forest algorithm to build exactly 1000 trees and compute impurity-based variable importance.
All main model arguments are specified directly under tune_params:
# Set fixed engine parameters
model_fixed_params <- fastml(
data = iris_binary,
label = "Species",
algorithms = "rand_forest",
algorithm_engines = list(rand_forest = "ranger"),
# Provide fixed engine arguments inside tune_params
tune_params = list(
rand_forest = list(
ranger = list(
trees = 1000,
importance = "impurity"
)
)
),
use_default_tuning = TRUE # required to apply tune_params
)
# Inspect model parameters
summary(model_fixed_params, type = "params")
#> Best Model hyperparameters:
#>
#> Model: rand_forest (ranger)
#> mtry: 1
#> trees: 1000
#> min_n: 2
This approach ensures reproducibility and consistency across runs by keeping model hyperparameters constant. It is particularly useful for standardized benchmarking, simulation studies, or sensitivity analyses where tuning variability is undesirable.
fastml extends beyond the standard parsnip modeling engines by supporting advanced, high-performance backends such as H2O.
H2O is an open-source, in-memory, distributed machine learning platform designed for scalability and speed,
capable of efficiently handling large datasets that exceed the limits of traditional in-memory R modeling workflows.
By integrating H2O directly, fastml enables seamless access to its optimized implementations of algorithms such as:
rand_forest, engine = "h2o")boost_tree, engine = "h2o")mlp, engine = "h2o")linear_reg, engine = "h2o")Using H2O through fastml provides parallelized training, automatic data distribution across CPU cores,
and highly efficient memory management—all while maintaining the same simple interface as standard engines.
Before using H2O as a backend, ensure that both the h2o and agua packages are installed.
The h2o package provides access to the distributed machine learning platform,
while the agua package serves as a bridge between parsnip (and thus fastml) and the H2O framework.
These packages only need to be installed once per R environment.
# 1. Install the packages
install.packages("h2o")
install.packages("agua") # <-- This is the new, required package
After installation, fastml automatically detects available H2O engines and registers them for all compatible algorithms.
No additional configuration is needed — once h2o and agua are installed, the backend is ready for use.
This integration allows you to train high-performance, distributed models using the same unified fastml syntax.
Whether you are building tree ensembles, deep neural networks, or large-scale GLMs,
the interface remains consistent, while computation is offloaded to the H2O engine for efficiency and scalability.
# 2. Load and initialize the H2O cluster
library(h2o)
library(agua) # must be loaded to enable H2O engines in parsnip/fastml
# Initialize the H2O backend
h2o.init()
#>
#> H2O is not running yet, starting it now...
#>
#> Note: In case of errors look at the following log files:
#> /var/folders/dr/pwksczrd3gg7sxbphrjs5twh0000gn/T//RtmpEXAbKV/file240f2d00b856/h2o_selcukkorkmaz_started_from_r.out
#> /var/folders/dr/pwksczrd3gg7sxbphrjs5twh0000gn/T//RtmpEXAbKV/file240fa462d2f/h2o_selcukkorkmaz_started_from_r.err
#>
#>
#> Starting H2O JVM and connecting: .... Connection successful!
#>
#> R is connected to the H2O cluster:
#> H2O cluster uptime: 3 seconds 331 milliseconds
#> H2O cluster timezone: Europe/Istanbul
#> H2O data parsing timezone: UTC
#> H2O cluster version: 3.46.0.8
#> H2O cluster version age: 21 days, 5 hours and 53 minutes
#> H2O cluster name: H2O_started_from_R_selcukkorkmaz_htm225
#> H2O cluster total nodes: 1
#> H2O cluster total memory: 4.00 GB
#> H2O cluster total cores: 8
#> H2O cluster allowed cores: 8
#> H2O cluster healthy: TRUE
#> H2O Connection ip: localhost
#> H2O Connection port: 54321
#> H2O Connection proxy: NA
#> H2O Internal Security: FALSE
#> R Version: R version 4.3.2 (2023-10-31)
Once initialized, H2O starts a local in-memory cluster that handles data storage, model training, and parallel computation.
The console output confirms successful initialization, displaying details such as available CPU cores, total memory, cluster name, and node configuration.
After this setup, fastml can seamlessly communicate with the H2O environment,
allowing you to train and evaluate large-scale models using distributed computation.
All H2O-based algorithms can now be accessed through the same unified fastml interface —
bringing scalability and performance without changing your workflow.
Once H2O is initialized and the agua package is loaded, you can specify "h2o" as the engine for any supported algorithm — for example, rand_forest.
fastml automatically converts your input data frame into an H2OFrame and handles model training within the H2O cluster.
Your helper scripts (such as plot.fastml.R and train_models.R) include logic to correctly manage H2O-specific behaviors, including different naming conventions for prediction columns.
# This code assumes that 'iris_binary' was created in Section 4.
# If your R session was restarted, re-run this setup.
# Start the H2O cluster
h2o.init()
#> Connection successful!
#>
#> R is connected to the H2O cluster:
#> H2O cluster uptime: 3 seconds 468 milliseconds
#> H2O cluster timezone: Europe/Istanbul
#> H2O data parsing timezone: UTC
#> H2O cluster version: 3.46.0.8
#> H2O cluster version age: 21 days, 5 hours and 53 minutes
#> H2O cluster name: H2O_started_from_R_selcukkorkmaz_htm225
#> H2O cluster total nodes: 1
#> H2O cluster total memory: 4.00 GB
#> H2O cluster total cores: 8
#> H2O cluster allowed cores: 8
#> H2O cluster healthy: TRUE
#> H2O Connection ip: localhost
#> H2O Connection port: 54321
#> H2O Connection proxy: NA
#> H2O Internal Security: FALSE
#> R Version: R version 4.3.2 (2023-10-31)
# Train a Random Forest model using the H2O engine
model_h2o <- fastml(
data = iris_binary,
label = "Species",
algorithms = "rand_forest",
algorithm_engines = list(rand_forest = "h2o")
)
# Summarize model performance
summary(model_h2o)
#>
#> ===== fastml Model Summary =====
#> Task: classification
#> Number of Models Trained: 1
#> Best Model(s): rand_forest (h2o) (accuracy: 0.9500000)
#>
#> Performance Metrics (Sorted by accuracy):
#>
#> ---------------------------------------------------------------------------------------------
#> Model Engine Accuracy F1 Score Kappa Precision Sensitivity Specificity ROC AUC
#> ---------------------------------------------------------------------------------------------
#> rand_forest* h2o 0.950 0.947 0.900 1.000 0.900 1.000 0.990
#> ---------------------------------------------------------------------------------------------
#> (*Best model)
#>
#> Best Model hyperparameters:
#>
#> Model: rand_forest (h2o)
#> mtry: 2
#> trees: 50
#> min_n: 2
#>
#>
#> ===========================
#> Confusion Matrices by Model
#> ===========================
#>
#> Model: rand_forest (h2o)
#> ---------------------------
#> Truth
#> Prediction versicolor virginica
#> versicolor 9 0
#> virginica 1 10
# Visualize comparison metrics
plot(model_h2o, type = "bar")
# Shut down the H2O cluster when finished
h2o.shutdown(prompt = FALSE)
Once initialized, fastml communicates directly with the H2O backend, preserving the same simple and consistent interface used for all other engines.
Under the hood, data are automatically converted to H2OFrames, and training is distributed across available CPU cores for faster computation.
The summary() and plot() functions operate exactly as they do with standard engines,
allowing you to evaluate and visualize model performance without changing your workflow.
This seamless integration enables users to scale from small, local datasets to large, distributed environments with no additional configuration.
Training a model is only the first step — understanding why it makes certain predictions is equally important for building trust, diagnosing issues, and ensuring responsible deployment.
The fastexplain function in fastml serves as a convenient, automated interface for model-agnostic interpretability.
It leverages the DALEX package to generate multiple complementary explanations in a single call, making interpretation accessible and consistent across models.
By default, method = "dalex" is used, which combines three essential interpretability tools:
Let’s apply fastexplain() to the regression model (model_reg) we trained in Section 6.
Note: The
DALEXpackage must be installed to enable this functionality.
install.packages("DALEX")
When you call fastexplain(), it automatically performs a sequence of interpretation steps for the best-performing model within the fastml object:
features argument).Let’s demonstrate this on our pbc regression model, visualizing how bilirubin (bili) and age influence the predicted serum albumin levels.
model_reg <- fastml(
data = pbc_baseline,
label = "albumin",
algorithms = c("xgboost"),
metric = "rmse",
impute_method = "medianImpute"
)
explain_model =fastexplain(
model_reg,
method = "dalex",
features = c("bili", "age"), # specify variables for partial dependence
shap_sample = 20, # number of samples for SHAP computation
vi_iterations = 15 # number of iterations for permutation importance
)
#> Preparation of a new explainer is initiated
#> -> model label : xgboost
#> -> data : 249 rows 19 cols
#> -> target variable : 249 values
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package , ver. , task regression
#> -> predicted values : numerical, min = 2.56859 , mean = 2.890936 , max = 3.021393
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -0.8480468 , mean = 0.6404692 , max = 1.618607
#> A new explainer has been created!
#>
#> === DALEX Variable Importance (with Boxplots) ===
#>
#> === DALEX Model Profiles (Partial Dependence) ===
#>
#> === DALEX Shapley Values (SHAP) ===
This single call creates a complete interpretability dashboard.
When you execute fastexplain(), it automatically performs and displays the interpretability workflow in your console.
The output log confirms the process step-by-step, typically including messages such as:
"A new explainer has been created!" — indicating that a DALEX explainer has been successfully initialized for the selected model (e.g., xgboost)."DALEX Variable Importance" — computation and plotting of permutation-based feature importance."DALEX Model Profiles" — generation of partial dependence profiles for selected features."DALEX Shapley values (SHAP)" — calculation and visualization of feature-level contribution summaries.All plots are automatically rendered to the console for immediate inspection.
However, fastexplain() also returns an object (commonly stored as explain_model) that contains the underlying data from each explanation step.
This allows for deeper exploration and customized visualization.
In this example, the explain_model$variable_importance data indicates that edema, id, and stage are the most critical predictors.
explain_model$variable_importance
#> variable mean_dropout_loss label
#> 1 _full_model_ 0.7329619 xgboost
#> 2 status 0.7317232 xgboost
#> 3 protime 0.7324866 xgboost
#> 4 trig 0.7327706 xgboost
#> 5 copper 0.7329156 xgboost
#> 6 trt 0.7329619 xgboost
#> 7 age 0.7329619 xgboost
#> 8 spiders 0.7329619 xgboost
#> 9 sex_f 0.7329619 xgboost
#> 10 stage 0.7331607 xgboost
#> 11 chol 0.7333779 xgboost
#> 12 edema 0.7337676 xgboost
#> 13 alk.phos 0.7344536 xgboost
#> 14 ast 0.7345508 xgboost
#> 15 hepato 0.7345674 xgboost
#> 16 id 0.7350603 xgboost
#> 17 platelet 0.7360618 xgboost
#> 18 ascites 0.7364314 xgboost
#> 19 time 0.7364955 xgboost
#> 20 bili 0.7405471 xgboost
#> 21 _baseline_ 0.7670726 xgboost
These variables exhibit the largest mean_dropout_loss values — for instance, edema shows a mean dropout loss of approximately 0.730, meaning that shuffling this feature leads to a substantial decline in predictive accuracy.
This confirms that the model relies heavily on these variables when estimating outcomes, making them key drivers of the prediction mechanism.
The explain_model$shap_values data frame contains the underlying statistics for these contributions.
For example, in this model, the feature stage exhibits a strong average negative contribution to the predicted albumin levels (mean ≈ -0.056), while edema shows a notable positive contribution (mean ≈ 0.013).
explain_model$shap_values
#> min q1 median mean q3 max
#> xgboost: age = -0.552 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
#> xgboost: alk.phos = 0.05379 0.001387102 0.002700264 0.0027545791 0.0026821595 0.0029252495 0.0038771486
#> xgboost: ascites = -0.2615 0.005510047 0.005510047 0.0055100468 0.0055100468 0.0055100468 0.0055100468
#> xgboost: ast = -0.3443 -0.004132973 -0.003937656 -0.0033596044 -0.0027815527 -0.0010235702 -0.0009005185
#> xgboost: bili = -0.3019 0.020106583 0.021965659 0.0232730065 0.0233773643 0.0250576775 0.0268616016
#> xgboost: chol = 0.1443 0.000000000 0.000000000 0.0000000000 0.0006944097 0.0017360243 0.0017360243
#> xgboost: copper = -0.653 0.014296399 0.014296399 0.0227879358 0.0190516594 0.0227879358 0.0227879358
#> xgboost: edema = -0.3781 0.000000000 0.000000000 0.0005734783 0.0011469565 0.0018109040 0.0031800873
#> xgboost: hepato = 0.9703 -0.012540125 -0.012540125 -0.0111928388 -0.0116778618 -0.0111928388 -0.0111928388
#> xgboost: id = 0.2328 0.004901205 0.006248491 0.0065847736 0.0063963692 0.0065847736 0.0079320597
#> xgboost: platelet = 0.1727 0.005416969 0.006847279 0.0069658546 0.0073312484 0.0078949179 0.0086953736
#> xgboost: protime = -0.8004 0.000000000 0.000000000 0.0008198774 0.0008409430 0.0021155235 0.0021155235
#> xgboost: sex_f = 0.3408 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
#> xgboost: spiders = -0.6241 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
#> xgboost: stage = 1.113 -0.008491537 -0.008491537 -0.0084915372 -0.0047552608 0.0000000000 0.0000000000
#> xgboost: status = -0.8448 0.000000000 0.000000000 0.0030313014 0.0025049399 0.0048533089 0.0054591822
#> xgboost: time = 0.4705 0.018564520 0.021138239 0.0232234762 0.0234712759 0.0260295897 0.0293344762
#> xgboost: trig = 0.2066 0.000000000 0.000000000 0.0000000000 0.0002181144 0.0006058733 0.0006058733
#> xgboost: trt = 1.035 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Together, these patterns reveal how different predictors systematically push the model’s predictions higher or lower,
offering interpretable insight into the biological or clinical relevance of each feature.
The explain_model$model_profiles object contains the raw data underlying these plots.
Within it, the $agr_profiles table provides the average predicted response (_yhat_, corresponding to albumin) for different values (_x_) of each feature —
these averaged predictions form the basis of the partial dependence curves.
explain_model$model_profiles
#> Top profiles :
#> _vname_ _label_ _x_ _yhat_ _ids_
#> 1 age xgboost -2.263518 2.869726 0
#> 2 age xgboost -1.912508 2.869726 0
#> 3 age xgboost -1.820323 2.869726 0
#> 4 age xgboost -1.729347 2.869726 0
#> 5 age xgboost -1.650627 2.869726 0
#> 6 age xgboost -1.596269 2.869726 0
Inspecting this table allows for precise, data-level understanding of how each feature influences model output beyond the visual summaries.
These visualizations form a comprehensive, model-agnostic explanation suite — clarifying both which features matter most and how they influence the predictions produced by the trained model.