Double cross-validation workflow with gaQSAR

This vignette shows the double cross-validation workflow in gaQSAR. This workflow is more strict than a single train/test split. It separates model selection from model evaluation by using an outer cross-validation loop.

The outer loop leaves out part of the data for testing. Inside each training fold, gaVariableSelection() is used to select descriptors. The left-out outer fold is then predicted. Repeating this over all outer folds gives an external cross-validated estimate of predictive performance.

This is the workflow:

  1. prepare the QSAR data;
  2. run gaDoubleCrossValidation() for several numbers of predictors;
  3. compare model sizes using training metrics and outer Q2;
  4. inspect selected descriptors and their stability;
  5. inspect the Williams plot across outer folds;
  6. select a model size;
  7. optionally run a permutation test.

The code below is based on the AquaticTox example analysis. Set smokeTest <- TRUE for a short run.

Load packages and helper function

library(gaQSAR)
library(QSARdata)

timestampFileName <- function(fileName, time = Sys.time()) {
  paste0(format(time, "%Y%m%d_%H%M%S_"), fileName)
}

Prepare the data

data(AquaticTox)

qsarLabel <- "AquaticTox"

qsarData <- cbind(
  activity = AquaticTox_Outcome$Activity,
  AquaticTox_moe2D[, -1],
  AquaticTox_moe3D[, -1]
)

missingColumns <- which(colSums(is.na(qsarData)) != 0)

if (length(missingColumns) > 0) {
  qsarData <- qsarData[, -missingColumns]
}

xAll <- as.matrix(qsarData[, -1, drop = FALSE])
yAll <- qsarData[, 1]

All compounds are used in the double cross-validation. The external validation is created inside the outer cross-validation loop.

Choose settings

smokeTest <- FALSE
permutationTest <- TRUE
nWorkers <- 1L

numVars <- 1:10
theSeeds <- 1:5

gaSettings <- list(
  pMutation = 0.2,
  pCrossover = 0.7,
  popSize = 300,
  maxIter = 300,
  interval = 50,
  elitism = 30,
  crossoverType = "gaintegerOnePointCrossover"
)

outerK <- 5
outerSeed <- 1

if (smokeTest) {
  gaSettings$popSize <- 25
  gaSettings$maxIter <- 10
  gaSettings$elitism <- 2
  gaSettings$interval <- 5
  numVars <- c(3, 5, 7)
  theSeeds <- 1:2
  outerK <- 3
}

The vector numVars defines the model sizes to compare. For each model size, a full double cross-validation is run.

The outerSeed fixes the outer fold split. The theSeeds vector controls repeated GA runs within the training part of each outer fold.

Run double cross-validation

baseArgs <- list(
  x = xAll,
  y = yAll,
  outerMethod = "kfold",
  outerK = outerK,
  seed = outerSeed,
  popSize = gaSettings$popSize,
  pMutation = gaSettings$pMutation,
  pCrossover = gaSettings$pCrossover,
  crossoverFunc = gaSettings$crossoverType,
  elitism = gaSettings$elitism,
  maxIter = gaSettings$maxIter,
  interval = gaSettings$interval,
  seeds = theSeeds,
  verbose = TRUE
)

fitOneModelSize <- function(numberOfVariables) {
  args <- baseArgs
  args$numberOfVariables <- numberOfVariables
  do.call(gaQSAR::gaDoubleCrossValidation, args)
}

output <- lapply(numVars, fitOneModelSize)
names(output) <- paste0("p", numVars)

Each element of output is a gaQSAR_dcv object. The object contains the results for all outer folds for one fixed number of predictors.

Optional parallel execution

The double cross-validation workflow can be slow. In a real analysis, the run over numVars can be parallelized.

library(future)
library(future.apply)

useFuture <- nWorkers > 1L

if (useFuture) {
  oldPlan <- future::plan()
  on.exit(future::plan(oldPlan), add = TRUE)
  future::plan(future::multisession, workers = nWorkers)

  output <- future.apply::future_lapply(
    X = numVars,
    FUN = fitOneModelSize,
    future.seed = TRUE,
    future.packages = "gaQSAR",
    future.chunk.size = 1
  )
  names(output) <- paste0("p", numVars)
}

The parallelization above is over model sizes. Depending on the computer and the data size, other choices may be faster or slower. Do not use more workers than the machine can handle comfortably.

Compare model sizes

trainingPlot <- createDCVTrainingMetricsPlot(
  output,
  metrics = c("R2", "R2adj", "Q2"),
  includeOuterQ2 = TRUE
)

print(trainingPlot)

This plot compares the model sizes. The training metrics describe the models fitted inside the training folds. The outer Q2 is the more important value for prediction, because it is based on compounds that were left out in the outer loop.

Select a model size

outerQ2Values <- vapply(output, function(object) object$outerQ2, numeric(1))
bestIdx <- which.max(outerQ2Values)
bestObj <- output[[bestIdx]]

cat(sprintf("Selected model size: %d predictors\n", numVars[bestIdx]))
cat(sprintf("Outer Q2: %.4f\n", bestObj$outerQ2))

This selects the model size with the highest outer Q2. That is a reasonable default, but not a law. If two model sizes perform about the same, the smaller or more interpretable model is preferable.

In a manuscript, it is often better to show the full pattern over model sizes instead of only reporting the maximum.

Inspect the selected model size

summary(bestObj)
plot(bestObj, type = "all")

The summary gives the main results of the selected double cross-validation run. The plot method can show several diagnostics, including selected descriptor frequency and object-level diagnostics.

Williams plot across outer folds

williamsPlot <- createDCVWilliamsPlot(
  bestObj,
  label = "AquaticTox data",
  colorBy = "fold",
  aggregation = "none",
  labelOutliers = "rowName"
)

print(williamsPlot + ggplot2::facet_wrap(~ fold))

The Williams plot combines the fold-specific training diagnostics stored in the DCV object. It is useful for checking whether the fitted models repeatedly identify high-leverage or high-residual compounds. The outer-fold predictive performance is summarized separately by the outer Q2 and the stored outer predictions. Faceting by fold is useful when the goal is to see whether outlying behaviour is concentrated in one split or appears repeatedly.

As with all Williams plots, high leverage is not automatically a mistake. It means that the object is far from the centre of the descriptor space used by the model.

Best fitness plot

fitnessPlot <- createBestFitnessPlot(bestObj)
print(fitnessPlot)

This plot is useful for checking whether the genetic algorithm had enough iterations to stabilize. If the best fitness is still improving near the end, more iterations may be needed.

Permutation test

if (permutationTest) {
  nPermutations <- if (smokeTest) 20 else 100

  permutationResult <- gaPermutationTest(
    bestObj,
    x = xAll,
    nPermutations = nPermutations,
    seed = 1,
    validateSettings = TRUE,
    verbose = TRUE,
    workers = nWorkers
  )

  print(plot(permutationResult))
  summary(permutationResult)
}

The permutation test repeats the analysis after scrambling the response. This checks whether the observed performance is clearly better than what is obtained after breaking the relation between descriptors and response.

The permutation test can be expensive for double cross-validation. Use a small value of nPermutations only for testing the code. Use a larger value for a final analysis.

Save results

save(output, file = timestampFileName(paste0(qsarLabel, "NestedCV_Results.Rdata")))

if (exists("permutationResult")) {
  save(
    permutationResult,
    file = timestampFileName(paste0(qsarLabel, "NestedCV_PermutationResults.Rdata"))
  )
}

Summary

The double cross-validation workflow is the safer route when the aim is to estimate predictive performance after model selection. It is slower than the train/test workflow, but it avoids putting too much weight on one single split.

The result is a list of gaQSAR_dcv objects, one for each model size. The model sizes can then be compared by outer Q2, selection stability and diagnostic plots.