Using ‘splitTools’

Overview

{splitTools} is a fast, lightweight toolkit for data splitting.

Its two main functions partition() and create_folds() support

The function create_timefolds() does time-series splitting in the sense that the out-of-sample data follows the in-sample data.

We will now illustrate how to use {splitTools} in a typical modeling workflow.

Usage

Simple validation

We will go through the following steps:

  1. We split the iris data into 60% training, 20% validation, and 20% test data, stratified by the variable Sepal.Length. Since this variable is numeric, stratification uses quantile binning.
  2. We will model the response Sepal.Length with a linear regression, once with and once without interaction between Species and Sepal.Width.
  3. After selecting the better of the two models via validation RMSE, we evaluate the final model on the test data.
library(splitTools)

# Split data into partitions
set.seed(3451)
inds <- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
str(inds)
#> List of 3
#>  $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#>  $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#>  $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...

train <- iris[inds$train, ]
valid <- iris[inds$valid, ]
test <- iris[inds$test, ]

rmse <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

# Use simple validation to decide on interaction yes/no...
fit1 <- lm(Sepal.Length ~ ., data = train)
fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train)

rmse(valid$Sepal.Length, predict(fit1, valid))
#> [1] 0.3020855
rmse(valid$Sepal.Length, predict(fit2, valid))
#> [1] 0.2954321

# Yes! Choose and test final model
rmse(test$Sepal.Length, predict(fit2, test))
#> [1] 0.3482849

CV

Since the iris data consists of only 150 rows, investing 20% of observations for validation seems like a waste of resources. Furthermore, the performance estimates might not be very robust. Let’s replace simple validation by five-fold CV, again using stratification on the response variable.

  1. Split iris into 80% training data and 20% test, stratified by the variable Sepal.Length.
  2. Use stratified five-fold CV to choose between the two models.
  3. We evaluate the final model on the test data.
# Split into training and test
inds <- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87)

train <- iris[inds$train, ]
test <- iris[inds$test, ]

# Get stratified CV in-sample indices
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734)

# Vectors with results per model and fold
cv_rmse1 <- cv_rmse2 <- numeric(5)

for (i in seq_along(folds)) {
  insample <- train[folds[[i]], ]
  out <- train[-folds[[i]], ]
  
  fit1 <- lm(Sepal.Length ~ ., data = insample)
  fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
  
  cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
  cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}

# CV-RMSE of model 1 -> close winner
mean(cv_rmse1)
#> [1] 0.330189

# CV-RMSE of model 2
mean(cv_rmse2)
#> [1] 0.3306455

# Fit model 1 on full training data and evaluate on test data
final_fit <- lm(Sepal.Length ~ ., data = train)
rmse(test$Sepal.Length, predict(final_fit, test))
#> [1] 0.2892289

Repeated CV

If feasible, repeated CV is recommended in order to reduce uncertainty in decisions. Otherwise, the process remains the same.

# Train/test split as before

# 15 folds instead of 5
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3)
cv_rmse1 <- cv_rmse2 <- numeric(15)

# Rest as before...
for (i in seq_along(folds)) {
  insample <- train[folds[[i]], ]
  out <- train[-folds[[i]], ]
  
  fit1 <- lm(Sepal.Length ~ ., data = insample)
  fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
  
  cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
  cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}

mean(cv_rmse1)
#> [1] 0.3296087
mean(cv_rmse2)
#> [1] 0.331373

# Refit and test as before

Stratification on multiple columns

The function multi_strata() creates a stratification factor from multiple columns that can then be passed to create_folds(, type = "stratified") or partition(, type = "stratified"). The resulting partitions will be (quite) balanced regarding these columns.

Two grouping strategies are offered:

  1. k-means clustering based on scaled input.
  2. All combinations of columns, where numeric input is being binned.

Let’s have a look at a simple example where we want to model “Sepal.Width” as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response “Sepal.Width”, but also regarding the important predictor “Species”. In this case, we could use the following workflow:

set.seed(3451)

ir <- iris[c("Sepal.Length", "Species")]
y <- multi_strata(ir, k = 5)
inds <- partition(
  y, p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
)

# Check
by(ir, inds, summary)
#> inds: train
#>   Sepal.Length         Species  
#>  Min.   :4.300   setosa    :30  
#>  1st Qu.:5.100   versicolor:30  
#>  Median :5.800   virginica :30  
#>  Mean   :5.836                  
#>  3rd Qu.:6.400                  
#>  Max.   :7.700                  
#> ------------------------------------------------------------ 
#> inds: valid
#>   Sepal.Length         Species  
#>  Min.   :4.400   setosa    :10  
#>  1st Qu.:5.425   versicolor:10  
#>  Median :5.900   virginica :10  
#>  Mean   :5.903                  
#>  3rd Qu.:6.300                  
#>  Max.   :7.900                  
#> ------------------------------------------------------------ 
#> inds: test
#>   Sepal.Length         Species  
#>  Min.   :4.700   setosa    :10  
#>  1st Qu.:5.100   versicolor:10  
#>  Median :5.700   virginica :10  
#>  Mean   :5.807                  
#>  3rd Qu.:6.475                  
#>  Max.   :7.100