API Improvements:
getCDs():
plot_frm():
preprocess_data():
add_cds to control whether chemical
descriptors should be added to the input data using
getCDs().rm_ucs to control whether unsupported
columns (i.e. columns that are neither mandatory nor optional) should be
removed from the input data.rt_terms to control whether
transformations of the RT column (square, cube, log, exp, sqrt) should
be added to the input data.CDFeatures
are allowed as optional columns.train_frm():
do_cv to control whether
cross-validation should be performed for performance estimation. Default
is TRUE.method now accepts two values for training
models with xgbtree base: “gbtreeDefault” (train xgboost with default
params) and “gbtreeRP” (train xgboost with parameters optimized for the
RP dataset). The old value “gbtree” still works and is now an alias for
“gbtreeDefault”.frm
objects are fully specified now).clip_predictions(). Of
course, the clipping is always based on the RT range of training folds,
not the whole original training data.predict.frm():
degree_polynomial>1 and/or
interaction_terms=TRUE, unless the transformations were
manually applied to the new data beforehand.clip to allow clipping of predictions to
be within the RT range of the training data. Works for both adjusted and
unadjusted models.clip=FALSE.
See clip_predictions() for details.impute=FALSE in
predict.frm().selective_measuring():
rt_coef, allowing user to control the
influence of RT on the clustering. A value of 0 means that RT is
ignored, a value of “max_ridge_coefficient” means that RT has the same
weight as the most important chemical descriptor and a value of 1 means
no scaling at all (except standardization to z-scores, which is applied
before to the whole dataset before the ridge regression is
trained).adjust_frm():
seed to allow reproducible results.do_cv to control whether
cross-validation should be performed for performance estimation. Default
is TRUE.adj_type to control which model should
be trained for adjustment: supported options are “lm”, “lasso”, “ridge”,
or “gbtree”. Previously, only “lm” was supported. To stay backwards
compatible, the default is “lm”.add_cds to control whether chemical
descriptors should be added to the input data using
getCDs(). Only recommended for adj_type other than
“lm”.clip_predictions(). Of
course, the clipping is always based on the RT range of training folds,
not the whole original training data.print.frm():
clip_predictions():
train_frm(),
predict.frm() and adjust_frm().get_predictors():
base and adjust to control
whether predictors for the base model, the adjustment model or both
should be returned.Bugfixes:
preprocess_data() are
now generated correctly as product of the involved features instead of a
division. This follows common practice in regression modeling and avoids
division by zero issues. Passing older models, trained with
division-based interaction terms, to downstream functions like
predict.frm() or adjust_frm() will now lead to
an error. (This is not a breaking change, as predict.frm()
and friends have in fact never been able to handle such models).plot_frm() with type “scatter.cv.adj” or
“scatter.train.adj” now correctly shows retention times from the new
data (used for model adjustment) as x-axis values instead of the
original training retention times.catf() now only emits escape codes (i.e. colored
output), it the output is directed to a terminal. If the output is
redirected to a file or a pipe, no escape codes are emitted anymore.
Since catf() is used throughout the package for logging,
this fixes the output for the whole package.Internal Improvements:
adjust_frm()fit_gbtree()fit_glmnet()get_param_grid()get_predictors()getCDs()plot_frm()predict_frm()preprocess_data()selective_measuring()train_frm()validate_inputdata()caret dependency by adding custom
implementations for:
createFolds()nearZeroVar()adjust_frm() into a
private function merge_dfs().fit_glmnet(), fit_lasso() and
fit_ridge() with a single function
fit_glmnet(), that takes the method (“lasso” or “ridge”) as
parameter. Instead of a dataframe df that has to contain
only predictors plus the RT column (as reponse), the function now takes
a matrix of predictors X and a vector of responses
y. This makes the function more flexible and easier to
test.fit_gbtree_grid() with a much simpler function
find_params_best(). Instead of allowing the specification
of every grid parameter, the new function instead accepts a keyword
searchspace for specifying predefined grids to choose
from.fit_gbtree by exposing lots of hardcoded
internal xgboost parameters as function parameters with sensible
defaults. In particular, the user can now set xpar to
“default”, “rpopt” or a predefined grid-size to train the model with
different hyperparameter settings. Furthermore, the function is now
written in a way that works with both, version 1.7.9.1 and the new
3.1.2.1 version published on 2025/12/03 (yes, version 2.x was skipped
completely).get_param_grid() for returning
predefined hyperparameter grids for xgboost model training based on
keywords like “tiny”, “small” or “large”.benchmark_find_params() to benchmark
runtime of find_params_best() for different numbers of
cores and/or threads. As it turns out, choosing a higher number of cores
is usually more efficient (at the cost of worse progress output).named(), as_str(),
is_valid_smiles() and as_canonical()selective_measuring() by aligning glmnet
coefficients to columns by name (more stable) and by including RT,
scaled by max(abs(coefs)), in PAM clustering.libwebp-dev as dependency to Dockerfile.Measurements_v8.xlsx to
inst/extdata/. The new list contains corrections to the old
RP dataset plus 1660 new measurements measured on a total
of 18 different chromatographic environments.seed parameter to
selective_measuring() function for reproducible clustering
resultstrain_frm() functiondigest and shinybusy
dependenciesinst/mockdata/getCDsFor1Molecule(),
get_cache_dir(), ram_cache (these were
exported, but declared as internal)parLapply2Improved read_retip_hilic_data(): the dataset is now
only downloaded from GitHub if the package is not installed. If it is
installed, the dataset is loaded directly.
Internal Changes:
TODOS.mdutil.R to
data.Rmisc/datasetsload_all() and document()
to util.Rxlsx and readxl packages with
openxlsxAdded a cache cleanup handler that gets registered via
reg.finalizer() upon package loading to ensure that the
cache directory is removed if it doesn’t contain any files that should
persist between R sessions.
Added an article about installation details incl. a troubleshooting section
Improved function docs
Improved examples by removing donttest
blocks
Improved examples & tests by using smaller example datasets to reduce runtime
Moved patch.R from the R folder to
misc/scripts, which is excluded from the package build
using .Rbuildignore. The file is conditionally sourced by
the private function start_gui_in_devmode() if available,
allowing its use during development without including it in the
package.
Added \value tags to the mentioned .Rd
files describing the functions’ return values.
Added Bonini et al. (2020) doi:10.1021/acs.analchem.9b05765 as reference to
the description part of the DESCRIPTION file, listing it as Related
work. This reference is used in the documentation for
read_retip_hilic_data() and ram_cache. No
additional references are used in the package documentation.
Added Fadi Fadil as a contributor. Fadi measured the example datasets shipped with FastRet.
Added ORCID IDs for contributors as described in CRAN’s checklist for submissions.
read_rp_xlsx() and
read_rpadj_xlsx() into donttest to prevent
note “Examples with CPU time > 2.5 times elapsed time: …”. By now I’m
pretty sure the culprit is the xlsx package, which uses a
java process for reading the file. Maybe we should switch to openxlsx or
readxl in the future.preprocess_data() to prevent note
“Examples with CPU time > 2.5 times elapsed time: preprocess_data
(CPU=2.772, elapsed=0.788)”.getCDs()Added examples to start_gui(),
fastret_app(), getCDsFor1Molecule(),
analyzeCDNames(), check_lm_suitabilitym(),
plot_lm_suitability(), extendedTask(),
selective_measuring(), train_frm(),
adjust_frm(), get_predictors()
Improved lots of existing examples
Added additional logging messages at various places
Submitted to CRAN, but rejected because the following examples caused at least one of the following notes on the CRAN testing machines: (1) “CPU time > 5s”, (2) “CPU time > 2.5 times elapsed time”. In this context, “CPU time” is calculated as the sum of the measured “user” and “system” times.
| function | user | system | elapsed | ratio |
|---|---|---|---|---|
| check_lm_suitability | 5.667 | 0.248 | 2.211 | 2.675 |
| predict.frm | 2.477 | 0.112 | 0.763 | 3.393 |
| getCDs | 2.745 | 0.089 | 0.961 | 2.949 |
Completely refactored source code, e.g.:
Added a test suite covering all important functions
The UI now uses Extended Tasks for background processing, allowing GUI usage by multiple users at the same time
The clustering now uses Partitioning Around Medoids (PAM) instead of k-means, which is faster and much better suited for our use case
The training of the Lasso and/or XGBoost models is no longer done
using caret but using glmnet and
xgboost directly. The new implementation is much faster and
allows for full control over the number of workers started.
Function getCDs now caches the results on Disk,
making the retrieval of chemical descriptors much faster
The GUI now has a console element, showing the progress of the background tasks like clustering and model training
The GUI has a cleaner interface, because lots of the options are now hidden in the “Advanced” tab by default and are only displayed upon user request
Initial version.
Copy of commit cd243aa82a56df405df8060b84535633cf06b692
of Christian
Amesöders Repository. (Christian wrote this initial version of
FastRet as part of his master thesis at the Institute of functional
Genomics, University of Regensburg).