Classification Modeling

Choonghyun Ryu

2021-02-22

Preface

Once the data set is ready for model development, the model is fitted, predicted and evaluated in the following ways:

The alookr package makes these steps fast and easy:

Data: Wisconsin Breast Cancer Data

BreastCancer of mlbench package is a breast cancer data. The objective is to identify each of a number of benign or malignant classes.

A data frame with 699 observations on 11 variables, one being a character variable, 9 being ordered or nominal, and 1 target class.:

library(mlbench)
data(BreastCancer)

# class of each variables
sapply(BreastCancer, function(x) class(x)[1])
             Id    Cl.thickness       Cell.size      Cell.shape   Marg.adhesion 
    "character"       "ordered"       "ordered"       "ordered"       "ordered" 
   Epith.c.size     Bare.nuclei     Bl.cromatin Normal.nucleoli         Mitoses 
      "ordered"        "factor"        "factor"        "factor"        "factor" 
          Class 
       "factor" 

Preperation the data

Perform data preprocessing as follows.:

Fix the missing value with dlookr::imputate_na()

find the variables that include missing value. and imputate the missing value using imputate_na() in dlookr package.

library(dlookr)
library(dplyr)

# variable that have a missing value
diagnose(BreastCancer) %>%
  filter(missing_count > 0)
# A tibble: 1 x 6
  variables   types  missing_count missing_percent unique_count unique_rate
  <chr>       <chr>          <int>           <dbl>        <int>       <dbl>
1 Bare.nuclei factor            16            2.29           11      0.0157

# imputation of missing value
breastCancer <- BreastCancer %>%
  mutate(Bare.nuclei = imputate_na(BreastCancer, Bare.nuclei, Class,
                         method = "mice", no_attrs = TRUE, print_flag = FALSE))

Split data set

Splits the dataset into a train set and a test set with split_by()

split_by() in the alookr package splits the dataset into a train set and a test set.

The ratio argument of the split_by() function specifies the ratio of the train set.

split_by() creates a class object named split_df.

library(alookr)

# split the data into a train set and a test set by default arguments
sb <- breastCancer %>%
  split_by(target = Class)

# show the class name
class(sb)
[1] "split_df"   "grouped_df" "tbl_df"     "tbl"        "data.frame"

# split the data into a train set and a test set by ratio = 0.6
tmp <- breastCancer %>%
  split_by(Class, ratio = 0.6)

The summary() function displays the following useful information about the split_df object:

# summary() display the some information
summary(sb)
** Split train/test set information **
 + random seed        :  32364 
 + split data            
    - train set count :  489 
    - test set count  :  210 
 + target variable    :  Class 
    - minority class  :  malignant (0.344778)
    - majority class  :  benign (0.655222)

# summary() display the some information
summary(tmp)
** Split train/test set information **
 + random seed        :  27827 
 + split data            
    - train set count :  419 
    - test set count  :  280 
 + target variable    :  Class 
    - minority class  :  malignant (0.344778)
    - majority class  :  benign (0.655222)

Check missing levels in the train set

In the case of categorical variables, when a train set and a test set are separated, a specific level may be missing from the train set.

In this case, there is no problem when fitting the model, but an error occurs when predicting with the model you created. Therefore, preprocessing is performed to avoid missing data preprocessing.

In the following example, fortunately, there is no categorical variable that contains the missing levels in the train set.

# list of categorical variables in the train set that contain missing levels
nolevel_in_train <- sb %>%
  compare_target_category() %>% 
  filter(is.na(train)) %>% 
  select(variable) %>% 
  unique() %>% 
  pull

nolevel_in_train
character(0)

# if any of the categorical variables in the train set contain a missing level, 
# split them again.
while (length(nolevel_in_train) > 0) {
  sb <- breastCancer %>%
    split_by(Class)

  nolevel_in_train <- sb %>%
    compare_target_category() %>% 
    filter(is.na(train)) %>% 
    select(variable) %>% 
    unique() %>% 
    pull
}

Handling the imbalanced classes data with sampling_target()

Issue of imbalanced classes data

Imbalanced classes(levels) data means that the number of one level of the frequency of the target variable is relatively small. In general, the proportion of positive classes is relatively small. For example, in the model of predicting spam, the class of interest spam is less than non-spam.

Imbalanced classes data is a common problem in machine learning classification.

table() and prop.table() are traditionally useful functions for diagnosing imbalanced classes data. However, alookr’s summary() is simpler and provides more information.

# train set frequency table - imbalanced classes data
table(sb$Class)

   benign malignant 
      458       241 

# train set relative frequency table - imbalanced classes data
prop.table(table(sb$Class))

   benign malignant 
0.6552217 0.3447783 

# using summary function - imbalanced classes data
summary(sb)
** Split train/test set information **
 + random seed        :  32364 
 + split data            
    - train set count :  489 
    - test set count  :  210 
 + target variable    :  Class 
    - minority class  :  malignant (0.344778)
    - majority class  :  benign (0.655222)

Handling the imbalanced classes data

Most machine learning algorithms work best when the number of samples in each class are about equal. And most algorithms are designed to maximize accuracy and reduce error. So, we requre handling an imbalanced class problem.

sampling_target() performs sampling to solve an imbalanced classes data problem.

Resampling - oversample minority class

Oversampling can be defined as adding more copies of the minority class.

Oversampling is performed by specifying “ubOver” in the method argument of the sampling_target() function.

# to balanced by over sampling
train_over <- sb %>%
  sampling_target(method = "ubOver")

# frequency table 
table(train_over$Class)

   benign malignant 
      312       312 

Resampling - undersample majority class

Undersampling can be defined as removing some observations of the majority class.

Undersampling is performed by specifying “ubUnder” in the method argument of the sampling_target() function.

# to balanced by under sampling
train_under <- sb %>%
  sampling_target(method = "ubUnder")

# frequency table 
table(train_under$Class)

   benign malignant 
      177       177 

Generate synthetic samples - SMOTE

SMOTE(Synthetic Minority Oversampling Technique) uses a nearest neighbors algorithm to generate new and synthetic data.

SMOTE is performed by specifying “ubSMOTE” in the method argument of the sampling_target() function.

# to balanced by SMOTE
train_smote <- sb %>%
  sampling_target(seed = 1234L, method = "ubSMOTE")

# frequency table 
table(train_smote$Class)

   benign malignant 
      708       531 

Cleansing the dataset for classification modeling with cleanse()

The cleanse() cleanse the dataset for classification modeling.

This function is useful when fit the classification model. This function does the following.:

In this example, The cleanse() function removed a variable ID with a high unique rate.

# clean the training set
train <- train_smote %>%
  cleanse
─ Checking unique value ────────────── unique value is one ─
No variables that unique value is one.

─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● Id = 431(0.347861178369653)

─ Checking character variables ──────────── categorical data ─
No character variables.

Extract test set for evaluation of the model with extract_set()

# extract test set
test <- sb %>%
  extract_set(set = "test")

Binary classification modeling with run_models()

run_models() performs some representative binary classification modeling using split_df object created by split_by().

run_models() executes the process in parallel when fitting the model. However, it is not supported in MS-Windows operating system and RStudio environment.

Currently supported algorithms are as follows.:

run_models() returns a model_df class object.

The model_df class object contains the following variables.:

result <- train %>% 
  run_models(target = "Class", positive = "malignant")
result
# A tibble: 6 x 7
  step     model_id     target is_factor positive  negative fitted_model
  <chr>    <chr>        <chr>  <lgl>     <chr>     <chr>    <list>      
1 1.Fitted logistic     Class  TRUE      malignant benign   <glm>       
2 1.Fitted rpart        Class  TRUE      malignant benign   <rpart>     
3 1.Fitted ctree        Class  TRUE      malignant benign   <BinaryTr>  
4 1.Fitted randomForest Class  TRUE      malignant benign   <rndmFrs.>  
5 1.Fitted ranger       Class  TRUE      malignant benign   <ranger>    
6 1.Fitted xgboost      Class  TRUE      malignant benign   <xgb.Bstr>  

Evaluate the model

Evaluate the predictive performance of fitted models.

Predict test set using fitted model with run_predict()

run_predict() predict the test set using model_df class fitted by run_models().

run_predict () is executed in parallel when predicting by model. However, it is not supported in MS-Windows operating system and RStudio environment.

The model_df class object contains the following variables.:

pred <- result %>%
  run_predict(test)
pred
# A tibble: 6 x 8
  step     model_id   target is_factor positive  negative fitted_model predicted
  <chr>    <chr>      <chr>  <lgl>     <chr>     <chr>    <list>       <list>   
1 2.Predi… logistic   Class  TRUE      malignant benign   <glm>        <fct [21…
2 2.Predi… rpart      Class  TRUE      malignant benign   <rpart>      <fct [21…
3 2.Predi… ctree      Class  TRUE      malignant benign   <BinaryTr>   <fct [21…
4 2.Predi… randomFor… Class  TRUE      malignant benign   <rndmFrs.>   <fct [21…
5 2.Predi… ranger     Class  TRUE      malignant benign   <ranger>     <fct [21…
6 2.Predi… xgboost    Class  TRUE      malignant benign   <xgb.Bstr>   <fct [21…

Calculate the performance metric with run_performance()

run_performance() calculate the performance metric of model_df class predicted by run_predict().

run_performance () is performed in parallel when calculating the performance evaluation metrics However, it is not supported in MS-Windows operating system and RStudio environment.

The model_df class object contains the following variables.:

# Calculate performace metrics.
perf <- run_performance(pred)
perf
# A tibble: 6 x 7
  step          model_id     target positive fitted_model predicted  performance
  <chr>         <chr>        <chr>  <chr>    <list>       <list>     <list>     
1 3.Performanc… logistic     Class  maligna… <glm>        <fct [210… <dbl [15]> 
2 3.Performanc… rpart        Class  maligna… <rpart>      <fct [210… <dbl [15]> 
3 3.Performanc… ctree        Class  maligna… <BinaryTr>   <fct [210… <dbl [15]> 
4 3.Performanc… randomForest Class  maligna… <rndmFrs.>   <fct [210… <dbl [15]> 
5 3.Performanc… ranger       Class  maligna… <ranger>     <fct [210… <dbl [15]> 
6 3.Performanc… xgboost      Class  maligna… <xgb.Bstr>   <fct [210… <dbl [15]> 

The performance variable contains a list object, which contains 15 performance metrics:

# Performance by analytics models
performance <- perf$performance
names(performance) <- perf$model_id
performance
$logistic
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.06190476  0.93809524  0.85915493  0.95312500  0.95312500  0.93150685 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.90370370  0.90370370  2.13815281  0.94167380  0.88762842  0.02620536 
    LiftAUC     GainAUC     KS_Stat 
 1.23148426  0.80706845 88.46318493 

$rpart
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.05238095  0.94761905  0.90769231  0.92187500  0.92187500  0.95890411 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.91472868  0.91472868  0.21225380  0.94750642  0.90517979  0.89276781 
    LiftAUC     GainAUC     KS_Stat 
 2.20619128  0.81112351 88.07791096 

$ctree
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.05238095  0.94761905  0.90769231  0.92187500  0.92187500  0.95890411 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.91472868  0.91472868  0.93401643  0.95649615  0.91588185  0.64694926 
    LiftAUC     GainAUC     KS_Stat 
 1.95823369  0.81737351 89.83304795 

$randomForest
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.04285714  0.95714286  0.91044776  0.95312500  0.95312500  0.95890411 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.93129771  0.93129771  0.11905365  0.98983305  0.97988014  0.69157170 
    LiftAUC     GainAUC     KS_Stat 
 1.96479234  0.84055060 94.52054795 

$ranger
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.03809524  0.96190476  0.92424242  0.95312500  0.95312500  0.96575342 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.93846154  0.93846154  0.10293743  0.99090325  0.98180651  0.89824963 
    LiftAUC     GainAUC     KS_Stat 
 2.12464131  0.84129464 95.20547945 

$xgboost
ZeroOneLoss    Accuracy   Precision      Recall Sensitivity Specificity 
 0.04285714  0.95714286  0.92307692  0.93750000  0.93750000  0.96575342 
   F1_Score Fbeta_Score     LogLoss         AUC        Gini       PRAUC 
 0.93023256  0.93023256  0.13026789  0.98651541  0.97281678  0.88623641 
    LiftAUC     GainAUC     KS_Stat 
 2.13764399  0.83824405 95.20547945 

If you change the list object to tidy format, you’ll see the following at a glance:

# Convert to matrix for compare performace.
sapply(performance, "c")
               logistic       rpart       ctree randomForest      ranger
ZeroOneLoss  0.06190476  0.05238095  0.05238095   0.04285714  0.03809524
Accuracy     0.93809524  0.94761905  0.94761905   0.95714286  0.96190476
Precision    0.85915493  0.90769231  0.90769231   0.91044776  0.92424242
Recall       0.95312500  0.92187500  0.92187500   0.95312500  0.95312500
Sensitivity  0.95312500  0.92187500  0.92187500   0.95312500  0.95312500
Specificity  0.93150685  0.95890411  0.95890411   0.95890411  0.96575342
F1_Score     0.90370370  0.91472868  0.91472868   0.93129771  0.93846154
Fbeta_Score  0.90370370  0.91472868  0.91472868   0.93129771  0.93846154
LogLoss      2.13815281  0.21225380  0.93401643   0.11905365  0.10293743
AUC          0.94167380  0.94750642  0.95649615   0.98983305  0.99090325
Gini         0.88762842  0.90517979  0.91588185   0.97988014  0.98180651
PRAUC        0.02620536  0.89276781  0.64694926   0.69157170  0.89824963
LiftAUC      1.23148426  2.20619128  1.95823369   1.96479234  2.12464131
GainAUC      0.80706845  0.81112351  0.81737351   0.84055060  0.84129464
KS_Stat     88.46318493 88.07791096 89.83304795  94.52054795 95.20547945
                xgboost
ZeroOneLoss  0.04285714
Accuracy     0.95714286
Precision    0.92307692
Recall       0.93750000
Sensitivity  0.93750000
Specificity  0.96575342
F1_Score     0.93023256
Fbeta_Score  0.93023256
LogLoss      0.13026789
AUC          0.98651541
Gini         0.97281678
PRAUC        0.88623641
LiftAUC      2.13764399
GainAUC      0.83824405
KS_Stat     95.20547945

compare_performance() return a list object(results of compared model performance). and list has the following components:

In this example, compare_performance() recommend the “ranger” model.

# Compaire the Performance metrics of each model
comp_perf <- compare_performance(pred)
comp_perf
$recommend_model
[1] "ranger"

$top_metric_count
    logistic        rpart        ctree randomForest       ranger      xgboost 
           1            1            0            1           12            2 

$mean_rank
    logistic        rpart        ctree randomForest       ranger      xgboost 
    5.615385     4.269231     4.500000     2.692308     1.307692     2.615385 

$top_metric
$top_metric$logistic
[1] "Recall"

$top_metric$rpart
[1] "LiftAUC"

$top_metric$ctree
NULL

$top_metric$randomForest
[1] "Recall"

$top_metric$ranger
 [1] "ZeroOneLoss" "Accuracy"    "Precision"   "Recall"      "Specificity"
 [6] "F1_Score"    "LogLoss"     "AUC"         "Gini"        "PRAUC"      
[11] "GainAUC"     "KS_Stat"    

$top_metric$xgboost
[1] "Specificity" "KS_Stat"    

Plot the ROC curve with plot_performance()

compare_performance() plot ROC curve.

# Plot ROC curve
plot_performance(pred)

Tunning the cut-off

In general, if the prediction probability is greater than 0.5 in the binary classification model, it is predicted as positive class. In other words, 0.5 is used for the cut-off value. This applies to most model algorithms. However, in some cases, the performance can be tuned by changing the cut-off value.

plot_cutoff () visualizes a plot to select the cut-off value, and returns the cut-off value.

pred_best <- pred %>% 
  filter(model_id == comp_perf$recommend_model) %>% 
  select(predicted) %>% 
  pull %>% 
  .[[1]] %>% 
  attr("pred_prob")

cutoff <- plot_cutoff(pred_best, test$Class, "malignant", type = "mcc")

cutoff
[1] 0.38

cutoff2 <- plot_cutoff(pred_best, test$Class, "malignant", type = "density")

cutoff2
[1] 0.9804

cutoff3 <- plot_cutoff(pred_best, test$Class, "malignant", type = "prob")

cutoff3
[1] 0.38

Performance comparison between prediction and tuned cut-off with performance_metric()

Compare the performance of the original prediction with that of the tuned cut-off. Compare the cut-off with the non-cut model for the model with the best performance comp_perf$recommend_model.

comp_perf$recommend_model
[1] "ranger"

# extract predicted probability
idx <- which(pred$model_id == comp_perf$recommend_model)
pred_prob <- attr(pred$predicted[[idx]], "pred_prob")

# or, extract predicted probability using dplyr
pred_prob <- pred %>% 
  filter(model_id == comp_perf$recommend_model) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  attr("pred_prob")

# predicted probability
pred_prob  
  [1] 0.000000e+00 8.956992e-01 0.000000e+00 0.000000e+00 4.510317e-03
  [6] 0.000000e+00 9.090730e-01 9.968857e-01 2.380952e-05 0.000000e+00
 [11] 0.000000e+00 5.457875e-05 0.000000e+00 9.993333e-01 9.576333e-01
 [16] 9.957397e-01 9.968857e-01 7.762913e-01 8.746698e-01 0.000000e+00
 [21] 3.981810e-01 9.692762e-01 9.166063e-01 1.718810e-02 2.644444e-03
 [26] 6.176984e-03 0.000000e+00 0.000000e+00 1.891317e-01 0.000000e+00
 [31] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
 [36] 0.000000e+00 9.941833e-01 9.921389e-01 0.000000e+00 9.251929e-01
 [41] 5.418254e-03 2.289040e-01 9.064603e-01 7.496825e-03 0.000000e+00
 [46] 1.000000e+00 9.524913e-01 0.000000e+00 0.000000e+00 4.177778e-02
 [51] 0.000000e+00 0.000000e+00 9.759103e-01 0.000000e+00 0.000000e+00
 [56] 9.725032e-01 4.195738e-01 0.000000e+00 3.135714e-03 5.457875e-05
 [61] 9.772500e-01 0.000000e+00 0.000000e+00 0.000000e+00 9.771071e-01
 [66] 5.457875e-05 5.457875e-05 9.968333e-01 9.782214e-01 9.988000e-01
 [71] 1.964183e-01 7.760404e-02 0.000000e+00 9.369984e-01 1.647016e-01
 [76] 9.995278e-01 9.911000e-01 0.000000e+00 9.071000e-01 9.993500e-01
 [81] 9.988000e-01 8.822563e-01 0.000000e+00 9.813341e-01 8.822563e-01
 [86] 8.161373e-01 0.000000e+00 0.000000e+00 9.251659e-01 9.994286e-01
 [91] 9.913381e-01 9.497095e-01 4.665143e-01 7.142825e-01 1.000000e+00
 [96] 0.000000e+00 0.000000e+00 9.361508e-01 0.000000e+00 5.457875e-05
[101] 8.900730e-01 9.418556e-01 1.409167e-01 3.884810e-01 9.388579e-01
[106] 9.971056e-01 9.838714e-01 0.000000e+00 5.457875e-05 0.000000e+00
[111] 1.551230e-01 0.000000e+00 7.266127e-01 6.535778e-01 1.000000e+00
[116] 0.000000e+00 9.986032e-01 0.000000e+00 3.011111e-03 0.000000e+00
[121] 0.000000e+00 1.297381e-02 0.000000e+00 9.953889e-01 0.000000e+00
[126] 0.000000e+00 9.992063e-01 0.000000e+00 0.000000e+00 0.000000e+00
[131] 9.991429e-01 0.000000e+00 9.510087e-01 2.837349e-01 0.000000e+00
[136] 9.992921e-01 9.943722e-01 0.000000e+00 1.179206e-02 0.000000e+00
[141] 0.000000e+00 9.224413e-01 0.000000e+00 0.000000e+00 1.323437e-01
[146] 0.000000e+00 9.899183e-01 4.314390e-01 0.000000e+00 5.457875e-05
[151] 5.441270e-03 0.000000e+00 0.000000e+00 2.027269e-04 1.000000e+00
[156] 0.000000e+00 3.548793e-02 5.457875e-05 0.000000e+00 0.000000e+00
[161] 0.000000e+00 9.966286e-01 0.000000e+00 0.000000e+00 0.000000e+00
[166] 0.000000e+00 0.000000e+00 0.000000e+00 1.116667e-03 5.457875e-05
[171] 0.000000e+00 0.000000e+00 0.000000e+00 5.457875e-05 0.000000e+00
[176] 0.000000e+00 2.380952e-05 0.000000e+00 0.000000e+00 0.000000e+00
[181] 5.208492e-02 9.949151e-01 0.000000e+00 0.000000e+00 0.000000e+00
[186] 9.987333e-01 0.000000e+00 3.877619e-02 0.000000e+00 0.000000e+00
[191] 0.000000e+00 0.000000e+00 0.000000e+00 9.989921e-01 5.457875e-05
[196] 1.326817e-01 0.000000e+00 1.350404e-02 1.029524e-02 0.000000e+00
[201] 0.000000e+00 9.959373e-01 0.000000e+00 0.000000e+00 0.000000e+00
[206] 0.000000e+00 7.363492e-03 0.000000e+00 1.000000e+00 8.456063e-01

# compaire Accuracy
performance_metric(pred_prob, test$Class, "malignant", "Accuracy")
[1] 0.9619048
performance_metric(pred_prob, test$Class, "malignant", "Accuracy",
                   cutoff = cutoff)
[1] 0.9666667

# compaire Confusion Matrix
performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix")
           actual
predict     benign malignant
  benign       141         3
  malignant      5        61
performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix", 
                   cutoff = cutoff)
           actual
predict     benign malignant
  benign       139         0
  malignant      7        64

# compaire F1 Score
performance_metric(pred_prob, test$Class, "malignant", "F1_Score")
[1] 0.9384615
performance_metric(pred_prob, test$Class,  "malignant", "F1_Score", 
                   cutoff = cutoff)
[1] 0.9481481
performance_metric(pred_prob, test$Class,  "malignant", "F1_Score", 
                   cutoff = cutoff2)
[1] 0.6938776

If the performance of the tuned cut-off is good, use it as a cut-off to predict positives.

Predict

If you have selected a good model from several models, then perform the prediction with that model.

Create data set for predict

Create sample data for predicting by extracting 100 samples from the data set used in the previous under sampling example.

data_pred <- train_under %>% 
  cleanse 
─ Checking unique value ────────────── unique value is one ─
No variables that unique value is one.

─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate
● Id = 346(0.977401129943503)

─ Checking character variables ──────────── categorical data ─
No character variables.

set.seed(1234L)
data_pred <- data_pred %>% 
  nrow %>% 
  seq %>% 
  sample(size = 50) %>% 
  data_pred[., ]

Predict with alookr and dplyr

Do a predict using the dplyr package. The last factor() function eliminates unnecessary information.

pred_actual <- pred %>%
  filter(model_id == comp_perf$recommend_model) %>% 
  run_predict(data_pred) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  factor()

pred_actual
 [1] malignant benign    benign    benign    benign    benign    malignant
 [8] benign    benign    malignant malignant malignant malignant benign   
[15] malignant malignant benign    benign    malignant benign    benign   
[22] malignant malignant benign    benign    malignant malignant malignant
[29] benign    malignant benign    benign    malignant benign    malignant
[36] benign    malignant benign    benign    malignant benign    malignant
[43] benign    benign    benign    malignant benign    malignant benign   
[50] benign   
Levels: benign malignant

If you want to predict by cut-off, specify the cutoff argument in the run_predict() function as follows.:

In the example, there is no difference between the results of using cut-off and not.

pred_actual2 <- pred %>%
  filter(model_id == comp_perf$recommend_model) %>% 
  run_predict(data_pred, cutoff) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  factor()

pred_actual2
 [1] malignant benign    benign    benign    benign    benign    malignant
 [8] benign    benign    malignant malignant malignant malignant benign   
[15] malignant malignant benign    benign    malignant benign    benign   
[22] malignant malignant benign    benign    malignant malignant malignant
[29] benign    malignant benign    benign    malignant benign    malignant
[36] benign    malignant benign    benign    malignant benign    malignant
[43] benign    benign    benign    malignant benign    malignant benign   
[50] benign   
Levels: benign malignant

sum(pred_actual != pred_actual2)
[1] 0