This vignette shows the double cross-validation workflow in
gaQSAR. This workflow is more strict than a single
train/test split. It separates model selection from model evaluation by
using an outer cross-validation loop.
The outer loop leaves out part of the data for testing. Inside each
training fold, gaVariableSelection() is used to select
descriptors. The left-out outer fold is then predicted. Repeating this
over all outer folds gives an external cross-validated estimate of
predictive performance.
This is the workflow:
gaDoubleCrossValidation() for several numbers of
predictors;The code below is based on the AquaticTox example analysis. Set
smokeTest <- TRUE for a short run.
data(AquaticTox)
qsarLabel <- "AquaticTox"
qsarData <- cbind(
activity = AquaticTox_Outcome$Activity,
AquaticTox_moe2D[, -1],
AquaticTox_moe3D[, -1]
)
missingColumns <- which(colSums(is.na(qsarData)) != 0)
if (length(missingColumns) > 0) {
qsarData <- qsarData[, -missingColumns]
}
xAll <- as.matrix(qsarData[, -1, drop = FALSE])
yAll <- qsarData[, 1]All compounds are used in the double cross-validation. The external validation is created inside the outer cross-validation loop.
smokeTest <- FALSE
permutationTest <- TRUE
nWorkers <- 1L
numVars <- 1:10
theSeeds <- 1:5
gaSettings <- list(
pMutation = 0.2,
pCrossover = 0.7,
popSize = 300,
maxIter = 300,
interval = 50,
elitism = 30,
crossoverType = "gaintegerOnePointCrossover"
)
outerK <- 5
outerSeed <- 1
if (smokeTest) {
gaSettings$popSize <- 25
gaSettings$maxIter <- 10
gaSettings$elitism <- 2
gaSettings$interval <- 5
numVars <- c(3, 5, 7)
theSeeds <- 1:2
outerK <- 3
}The vector numVars defines the model sizes to compare.
For each model size, a full double cross-validation is run.
The outerSeed fixes the outer fold split. The
theSeeds vector controls repeated GA runs within the
training part of each outer fold.
baseArgs <- list(
x = xAll,
y = yAll,
outerMethod = "kfold",
outerK = outerK,
seed = outerSeed,
popSize = gaSettings$popSize,
pMutation = gaSettings$pMutation,
pCrossover = gaSettings$pCrossover,
crossoverFunc = gaSettings$crossoverType,
elitism = gaSettings$elitism,
maxIter = gaSettings$maxIter,
interval = gaSettings$interval,
seeds = theSeeds,
verbose = TRUE
)
fitOneModelSize <- function(numberOfVariables) {
args <- baseArgs
args$numberOfVariables <- numberOfVariables
do.call(gaQSAR::gaDoubleCrossValidation, args)
}
output <- lapply(numVars, fitOneModelSize)
names(output) <- paste0("p", numVars)Each element of output is a gaQSAR_dcv
object. The object contains the results for all outer folds for one
fixed number of predictors.
The double cross-validation workflow can be slow. In a real analysis,
the run over numVars can be parallelized.
library(future)
library(future.apply)
useFuture <- nWorkers > 1L
if (useFuture) {
oldPlan <- future::plan()
on.exit(future::plan(oldPlan), add = TRUE)
future::plan(future::multisession, workers = nWorkers)
output <- future.apply::future_lapply(
X = numVars,
FUN = fitOneModelSize,
future.seed = TRUE,
future.packages = "gaQSAR",
future.chunk.size = 1
)
names(output) <- paste0("p", numVars)
}The parallelization above is over model sizes. Depending on the computer and the data size, other choices may be faster or slower. Do not use more workers than the machine can handle comfortably.
trainingPlot <- createDCVTrainingMetricsPlot(
output,
metrics = c("R2", "R2adj", "Q2"),
includeOuterQ2 = TRUE
)
print(trainingPlot)This plot compares the model sizes. The training metrics describe the models fitted inside the training folds. The outer Q2 is the more important value for prediction, because it is based on compounds that were left out in the outer loop.
outerQ2Values <- vapply(output, function(object) object$outerQ2, numeric(1))
bestIdx <- which.max(outerQ2Values)
bestObj <- output[[bestIdx]]
cat(sprintf("Selected model size: %d predictors\n", numVars[bestIdx]))
cat(sprintf("Outer Q2: %.4f\n", bestObj$outerQ2))This selects the model size with the highest outer Q2. That is a reasonable default, but not a law. If two model sizes perform about the same, the smaller or more interpretable model is preferable.
In a manuscript, it is often better to show the full pattern over model sizes instead of only reporting the maximum.
The summary gives the main results of the selected double cross-validation run. The plot method can show several diagnostics, including selected descriptor frequency and object-level diagnostics.
williamsPlot <- createDCVWilliamsPlot(
bestObj,
label = "AquaticTox data",
colorBy = "fold",
aggregation = "none",
labelOutliers = "rowName"
)
print(williamsPlot + ggplot2::facet_wrap(~ fold))The Williams plot combines the fold-specific training diagnostics stored in the DCV object. It is useful for checking whether the fitted models repeatedly identify high-leverage or high-residual compounds. The outer-fold predictive performance is summarized separately by the outer Q2 and the stored outer predictions. Faceting by fold is useful when the goal is to see whether outlying behaviour is concentrated in one split or appears repeatedly.
As with all Williams plots, high leverage is not automatically a mistake. It means that the object is far from the centre of the descriptor space used by the model.
This plot is useful for checking whether the genetic algorithm had enough iterations to stabilize. If the best fitness is still improving near the end, more iterations may be needed.
if (permutationTest) {
nPermutations <- if (smokeTest) 20 else 100
permutationResult <- gaPermutationTest(
bestObj,
x = xAll,
nPermutations = nPermutations,
seed = 1,
validateSettings = TRUE,
verbose = TRUE,
workers = nWorkers
)
print(plot(permutationResult))
summary(permutationResult)
}The permutation test repeats the analysis after scrambling the response. This checks whether the observed performance is clearly better than what is obtained after breaking the relation between descriptors and response.
The permutation test can be expensive for double cross-validation.
Use a small value of nPermutations only for testing the
code. Use a larger value for a final analysis.
The double cross-validation workflow is the safer route when the aim is to estimate predictive performance after model selection. It is slower than the train/test workflow, but it avoids putting too much weight on one single split.
The result is a list of gaQSAR_dcv objects, one for each
model size. The model sizes can then be compared by outer Q2, selection
stability and diagnostic plots.