The motivation for this package is to provide functions which help with the development and tuning of machine learning models in biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P >> n). While most machine learning pipelines involve splitting data into training and testing cohorts, typically 2/3 and 1/3 respectively, medical datasets may be too small for this, and so determination of accuracy in the left-out test set suffers because the test set is small. Nested cross-validation (CV) provides a way to get round this, by maximising use of the whole dataset for testing overall accuracy, while maintaining the split between training and testing.
In addition typical biomedical datasets often have many 10,000s of possible predictors, so filtering of predictors is commonly needed. However, it has been demonstrated that filtering on the whole dataset creates a bias when determining accuracy of models (Vabalas et al, 2019). Feature selection of predictors should be considered an integral part of a model, with feature selection performed only on training data. Then the selected features and accompanying model can be tested on hold-out test data without bias. Thus, it is recommended that any filtering of predictors is performed within the CV loops, to prevent test data information leakage.
This package enables nested cross-validation (CV) to be performed
using the commonly used glmnet
package, which fits elastic
net regression models, and the caret
package, which is a
general framework for fitting a large number of machine learning models.
In addition, nestedcv
adds functionality to enable
cross-validation of the elastic net alpha parameter when fitting
glmnet
models.
nestedcv
partitions the dataset into outer and inner
folds (default 10x10 folds). The inner fold CV, (default is 10-fold), is
used to tune optimal hyperparameters for models. Then the model is
fitted on the whole inner fold and tested on the left-out data from the
outer fold. This is repeated across all outer folds (default 10 outer
folds), and the unseen test predictions from the outer folds are
compared against the true results for the outer test folds and the
results concatenated, to give measures of accuracy (e.g. AUC and
accuracy for classification, or RMSE for regression) across the whole
dataset.
A final round of CV is performed on the whole dataset to determine hyperparameters to fit the final model to the whole data, which can be used for prediction with external data.
While some models such as glmnet
allow for sparsity and
have variable selection built-in, many models fail to fit when given
massive numbers of predictors, or perform poorly due to overfitting
without variable selection. In addition, in medicine one of the goals of
predictive modelling is commonly the development of diagnostic or
biomarker tests, for which reducing the number of predictors is
typically a practical necessity.
Several filter functions (t-test, Wilcoxon test, anova,
Pearson/Spearman correlation, random forest variable importance, and
ReliefF from the CORElearn
package) for feature selection
are provided, and can be embedded within the outer loop of the nested
CV.
install.packages("nestedcv")
library(nestedcv)
The following simulated example demonstrates the bias intrinsic to datasets where P >> n when applying filtering of predictors to the whole dataset rather than to training folds.
## Example binary classification problem with P >> n
x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors
y <- factor(rbinom(150, 1, 0.5)) # binary response
## Partition data into 2/3 training set, 1/3 test set
trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE)
## t-test filter using whole test set
filt <- ttest_filter(y, x, nfilter = 100)
filx <- x[, filt]
## Train glmnet on training set only using filtered predictor matrix
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-4
fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial")
## Predict response on test set
predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class")
predy <- as.vector(predy)
predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response")
predyp <- as.vector(predyp)
output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp)
## Results on test set
## shows bias since univariate filtering was applied to whole dataset
predSummary(output)
#> Reference
#> Predicted 0 1
#> 0 19 1
#> 1 6 24
#>
#> AUC Accuracy Balanced accuracy
#> 0.9632 0.8600 0.8600
## Nested CV
fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 7:10 / 10,
filterFUN = ttest_filter,
filter_options = list(nfilter = 100))
fit2
#> Nested cross-validation with glmnet
#> Filter: ttest_filter
#>
#> Final parameters:
#> lambda alpha
#> 0.0002422 0.7000000
#>
#> Final coefficients:
#> (Intercept) V8165 V6753 V5420 V2002 V15957
#> 0.158122 1.036604 -0.996862 0.942557 0.932080 -0.885451
#> V12161 V149 V14605 V10253 V13811 V19380
#> -0.855350 -0.808646 -0.763725 -0.728942 -0.710180 0.688520
#> V18617 V131 V17235 V1216 V11721 V2718
#> 0.683586 0.627403 0.626355 0.623360 -0.620367 0.608084
#> V5705 V15294 V13387 V6430 V6802 V8424
#> -0.603248 -0.577271 -0.570600 -0.564535 0.540707 0.540646
#> V12066 V16204 V302 V316 V5308 V11406
#> -0.501405 -0.495912 -0.475016 0.474889 0.468220 -0.458764
#> V177 V14701 V607 V14771 V9316 V12201
#> -0.454256 -0.453812 -0.447101 -0.445398 -0.438410 -0.425393
#> V16996 V19647 V14810 V19953 V3200 V6202
#> 0.417789 0.410985 0.410187 0.409565 0.405319 0.403458
#> V1401 V19283 V2622 V19730 V10210 V19132
#> -0.396163 0.384530 0.379192 0.344959 -0.297623 0.293044
#> V18637 V3796 V13096 V5074 V3248 V3823
#> -0.291865 0.273762 -0.271368 0.266954 -0.256296 -0.255859
#> V15329 V9712 V1049 V13601 V14588 V14550
#> -0.250955 0.249256 0.238189 -0.205895 0.204514 -0.196514
#> V2676 V11906 V5795 V10728 V14516 V7326
#> -0.183301 -0.177102 0.166737 0.157212 0.135911 0.127361
#> V10192 V1545 V11375 V7245 V19746 V11218
#> -0.122308 -0.119907 -0.118262 0.110202 0.107671 0.089807
#> V1014 V14395 V13375 V19104 V9469 V15006
#> 0.083563 0.075543 -0.075319 0.028984 0.021819 0.002434
#>
#> Result:
#> Reference
#> Predicted 0 1
#> 0 45 35
#> 1 31 39
#>
#> AUC Accuracy Balanced accuracy
#> 0.6143 0.5600 0.5596
testroc <- pROC::roc(output$testy, output$predyp, direction = "<", quiet = TRUE)
inroc <- innercv_roc(fit2)
plot(fit2$roc)
lines(inroc, col = 'blue')
lines(testroc, col = 'red')
legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds",
"Test partition, non-nested filtering"),
col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")
In this example the dataset is pure noise. Filtering of predictors on the whole dataset is a source of leakage of information about the test set, leading to substantially overoptimistic performance on the test set as measured by ROC AUC.
Figures A & B below show two commonly used, but biased methods in which cross-validation is used to fit models, but the result is a biased estimate of model performance. In scheme A, there is no hold-out test set at all, so there are two sources of bias/ data leakage: first, the filtering on the whole dataset, and second, the use of left-out CV folds for measuring performance. Left-out CV folds are known to lead to biased estimates of performance as the tuning parameters are ‘learnt’ from optimising the result on the left-out CV fold.
In scheme B, the CV is used to tune parameters and a hold-out set is used to measure performance, but information leakage occurs when filtering is applied to the whole dataset. Unfortunately this is commonly observed in many studies which apply differential expression analysis on the whole dataset to select predictors which are then passed to machine learning algorithms.
Figures C & D below show two valid methods for fitting a model with CV for tuning parameters as well as unbiased estimates of model performance. Figure C is a traditional hold-out test set, with the dataset partitioned 2/3 training, 1/3 test. Notably the critical difference between scheme B above, is that the filtering is only done on the training set and not on the whole dataset.
Figure D shows the scheme for fully nested cross-validation. Note that filtering is applied to each outer CV training fold. The key advantage of nested CV is that outer CV test folds are collated to give an improved estimate of performance compared to scheme C since the numbers for total testing are larger.