This vignette shows the train/test workflow used in
gaQSAR. The workflow is useful when a separate test set is
available, or when the user wants a quick first analysis before moving
to double cross-validation.
The main idea is simple. First split the compounds into a training
set and a test set. Then run the genetic algorithm for several model
sizes. Each run gives one gaQSAR object. These objects are
collected in a list, so that the models can be compared by their
cross-validated Q2 and their performance on the external test set.
This is the workflow:
gaVariableSelection() for several numbers of
predictors;predictOOBObjects();createQ2Plot();createWilliamsPlot();The examples below use the AquaticTox data from the
QSARdata package. For a short test run, set
smokeTest <- TRUE.
library(gaQSAR)
library(QSARdata)
timestampFileName <- function(fileName, time = Sys.time()) {
paste0(format(time, "%Y%m%d_%H%M%S_"), fileName)
}The timestamp helper is not needed by gaQSAR itself. It
is only used here to save result files without overwriting earlier
runs.
data(AquaticTox)
qsarLabel <- "AquaticTox"
qsarData <- cbind(
activity = AquaticTox_Outcome$Activity,
AquaticTox_moe2D[, -1],
AquaticTox_moe3D[, -1]
)
missingColumns <- which(colSums(is.na(qsarData)) != 0)
if (length(missingColumns) > 0) {
qsarData <- qsarData[, -missingColumns]
}The first column is the response variable. The remaining columns are molecular descriptors. In this example, descriptors with missing values are removed before modelling. For other data sets, a different preprocessing step may be more appropriate.
smokeTest <- FALSE
permutationTest <- TRUE
nWorkers <- 1L
numVars <- 2:7
theSeeds <- 1:5
gaSettings <- list(
pMutation = 0.2,
pCrossover = 0.7,
popSize = 100,
maxIter = 300,
interval = 50,
elitism = 3,
crossoverType = "gaintegerOnePointCrossover",
KSpercentage = 0.95
)
if (smokeTest) {
gaSettings$popSize <- 25
gaSettings$maxIter <- 10
gaSettings$elitism <- 2
gaSettings$interval <- 5
numVars <- c(3, 5, 7)
theSeeds <- 1:2
}The vector numVars defines the model sizes to compare.
For example, numVars <- 2:7 fits models with 2, 3, 4, 5,
6 and 7 descriptors.
The seeds argument is used because the genetic algorithm
is stochastic. Running several seeds gives a more stable result than
relying on a single GA run.
splitupMolecules <- splitUp(qsarData, method = "KS", pc = gaSettings$KSpercentage)
xTrain <- as.matrix(qsarData[splitupMolecules$model, -1, drop = FALSE])
yTrain <- qsarData[splitupMolecules$model, 1]
xTest <- as.matrix(qsarData[splitupMolecules$test, -1, drop = FALSE])
yTest <- qsarData[splitupMolecules$test, 1]The Kennard-Stone split is used here to select a training set that
covers the descriptor space. With KSpercentage = 0.95, the
external test set is small. That can be useful for a first analysis on
small data sets, but it should not be oversold as a strong external
validation.
baseArgs <- list(
x = xTrain,
y = yTrain,
popSize = gaSettings$popSize,
pMutation = gaSettings$pMutation,
pCrossover = gaSettings$pCrossover,
crossoverFunc = gaSettings$crossoverType,
elitism = gaSettings$elitism,
maxIter = gaSettings$maxIter,
interval = gaSettings$interval,
seeds = theSeeds,
verbose = TRUE
)
fitOneModelSize <- function(numberOfVariables) {
args <- baseArgs
args$numberOfVariables <- numberOfVariables
do.call(gaQSAR::gaVariableSelection, args)
}
output <- lapply(numVars, fitOneModelSize)
names(output) <- paste0("p", numVars)Each element of output is a gaQSAR object.
The list structure is important, because the plotting and comparison
functions work on a set of candidate models.
The same loop can be run in parallel with future.apply.
This is useful for real analyses, but it is not shown as the default in
this vignette because parallel code is less convenient for package
checks and examples.
library(future)
library(future.apply)
useFuture <- nWorkers > 1L
if (useFuture) {
oldPlan <- future::plan()
on.exit(future::plan(oldPlan), add = TRUE)
future::plan(future::multisession, workers = nWorkers)
output <- future.apply::future_lapply(
X = numVars,
FUN = fitOneModelSize,
future.seed = TRUE,
future.packages = "gaQSAR"
)
names(output) <- paste0("p", numVars)
}This adds external predictions and external validation statistics to each fitted model. The test set was not used during variable selection.
The Q2 plot is used to compare model sizes. The training-set Q2 is based on LOOCV. If test set predictions were added, the external Q2 is shown as well.
For small data sets, small differences in Q2 should not be treated as decisive. A smaller model with similar Q2 may be easier to interpret and less fragile.
output <- createWilliamsPlot(
output,
xTrain,
yTrain,
xTest,
yTest,
residualThreshold = 2.5
)
print(plot(output[[2]]))The Williams plot combines standardized residuals and leverages. It is mainly used to inspect the applicability domain and to find objects that behave differently from the rest of the data.
Objects with high leverage are not automatically wrong. They may be important compounds near the edge of the descriptor space. The plot should therefore be used as a diagnostic, not as a mechanical deletion rule.
Q2Values <- vapply(output, function(object) object$Q2Loocv, numeric(1))
bestIdx <- which.max(Q2Values)
bestObj <- output[[bestIdx]]
cat(sprintf("Selected model size: %d predictors\n", numVars[bestIdx]))
cat(sprintf("LOOCV Q2: %.4f\n", bestObj$Q2Loocv))
summary(bestObj)This selects the model with the highest LOOCV Q2. That is a practical rule, but not the only possible rule. When several model sizes perform similarly, the smaller model is often a reasonable choice.
if (permutationTest) {
nPermutations <- if (smokeTest) 20 else 100
permutationResult <- gaPermutationTest(
bestObj,
x = xTrain,
y = yTrain,
nPermutations = nPermutations,
seed = 1,
validateSettings = TRUE,
verbose = TRUE,
workers = nWorkers
)
print(plot(permutationResult))
summary(permutationResult)
}The permutation test scrambles the response and repeats the modelling procedure under the null situation where the descriptor matrix and the response have no real relation.
In this workflow, the permutation test is applied to the selected
model size. It does not fully repeat the whole earlier choice over all
values of numVars. That distinction matters if the
permutation result is used as formal evidence.
The train/test workflow is useful for a direct QSAR analysis with an
external test set. The result is a list of fitted gaQSAR
objects, one for each model size. This list is then used for prediction,
Q2 plotting and Williams plots.
For a less biased estimate of predictive performance, especially when model size is also chosen from the data, use the double cross-validation workflow shown in the second vignette.