MetaClean is a package for building classifiers to identify low quality integrations in untargeted metabolomics data. It uses a combination of 11 peak quality metrics and 8 potential machine learning algorithms to build predictive models using user provided chromatographic data and associated labels. Once a predictive model has been built, it can be used to assign predictive labels and class probabilities to untargeted metabolomics datasets. The package is designed for use with the preprocessing package XCMS and can be easily integrated into existing untargeted metabolomics pipelines.
MetaClean has two main use cases: (1) Training a Classifier Using User-Provdied Data and (2) Using Existing Models to Make Predictions. This tutorial will walk the user through the steps for each.
!!IMPORTANT!! While any version of XCMS's peak-picking, retention time correction, and grouping functions may be utilized, this package requires the user to provide two objects produced by the getEIC() and fillPeaks() functions. These functions require objects of the “xcmsSet” class which was replaced by the “XCMSnExp” class. If using the newest functions provided by XCMS, please convert the “XCMSnExp” object to an “xcmsSet” object using as(XCMSnExp_object, “xcmsSet”).
This section explains how to build a classifier using data provided by the user. It is recommended that users create classifiers specific to the mode, matrix, and instrumentation of the dataset for those methodologies that are frequently utilized by the lab. For example, if a lab often runs plasma samples in Reverse Phase Negative Mode using the same column and the same instrument, a classifier can be built once using a dataset gerneated with this method and then saved and used again and again to make predictions for every additional run that uses this methodology.
Before the classifier can be trained the user must invest some time in visually assesing and labeling at least two datasets: a development dataset to be used for training and at least one test dataset to be used for evaluating the performance of the classifiers and selecting the best model. It is recommended that the user perform the following steps to prepare the development and test datasets:
The following sections provide detailed explanations of the steps required in training a classifier using user provided data:
NOTE: The * denotes sections that require additional example data not included in the MetaClean package. Users can either provide their own fill and xcmsEIC objects or download and install the data package
MetaCleanData using the following code:
## UNCOMMENT THIS SECTION IF YOU WISH TO USE THE MetaCleanData DATA PACKAGE # # install devtools if not already installed # install.packages("devtools") # install the data package MetaCleanData from github # devtools::install_github("KelseyChetnik/MetaCleanData") # load MetaCleanData library # library(MetaCleanData)
MetaCleanData provides example data for development and test sets.
!!!IMPORTANT!!! If the user uses rsdFilter() for the development dataset all testing datasets should also be filtered before evaluated on the trained model.
Before generating training data, the user can optionally filter out EICs by RSD % using the rsdFilter() function.
The rsdFilter() function takes the following arguments:
The function returns a peak table containing those EICs for which RSD is below the user-provided threshold. The following is an example of how to use the rsdFilter() function:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED GITHUB PACKAGE MetaCleanData # # load the example input data # # example development data # data("group_development") # data("covar_development") # # # example test data # data("group_test") # data("covar_test") # # peak_table_development <- peakTable(group_development) # peak_table_development <- cbind(EICNo=1:nrow(peak_table_development), peak_table_development) # development_rsd_names <- as.character(covar_development[covar_development$SampleType=="LQC","FileNames"]) # filtered_peak_table_development <- rsdFilter(peakTable = peak_table_development, # eicColumn = "EICNo", # rsdColumns = development_rsd_names, # rsdThreshold = 0.3) # # peak_table_test <- peakTable(group_test) # peak_table_test <- cbind(EICNo=1:nrow(peak_table_test), peak_table_test) # test_rsd_names <- as.character(covar_test[covar_test$SampleType=="LQC","FileNames"]) # filtered_peak_table_test <- rsdFilter(peakTable = peak_table_test, # eicColumn = "EICNo", # rsdColumns = test_rsd_names, # rsdThreshold = 0.3)
After rsd filtering, the user can then proceed with randomly selecting and labeling EICs for training and then proceed with the steps outlined below.
Note: All example data used below was made WITHOUT RSD filtering.
To train a new classifier using
MetaClean, the user must provide these three data files for each dataset:
NOTE: The same group object must be used to produce both the xcmsEIC and fill objects.
The following are examples of the required xcmsEIC and fill objects:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED GITHUB PACKAGE MetaCleanData # # load the example input data # # example development data # data("eic_labels_development") # data("fill_development") # data("xs_development") # # example test data # data("fill_test") # data("xs_test")
These examples will be utilized throughout the remainder of the tutorial.
getEvalObj is called to extract the relevant data from the three objects provided by ther user and store them in an object of class
evalObj. This function takes the following arguments:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # call getEvalObj on development data # eicEval_development <- getEvalObj(xs = xs_development, fill = fill_development) # # # call getEvalObj on test data # eicEval_test <- getEvalObj(xs = xs_test, fill = fill_test)
evalObj has three slots:
getPeakQualityMetrics uses the
evalObj objects to calculate each of the 11 peak quality metrics. These metrics are: Apex Max-Boundary Ratio, Elution Shift, FWHM2Base, Jaggedness, Modality, Retention-Time Consistency, Symmetry, Gaussian Similarity, Sharpness, Triangle Peak Area Similarity Ratio (TPASR), and Zig-Zag Index. See (our paper) for a description of each metric.
This function takes the following arguments:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # calculate peak quality metrics for development dataset # # For 500 peaks and 89 samples, takes ~2.3 mins # pqm_development <- getPeakQualityMetrics(eicEvalData = eicEval_development, eicLabels_df = eic_labels_development) # # # calculate peak quality metrics for test dataset # # For 500 peaks and 100 samples, takes ~2.6 mins # pqm_test <- getPeakQualityMetrics(eicEvalData = eicEval_test)
getPeakQualityMetrics function returns an Mx13 or Mx12 matrix where M is equal to the number of peaks. There are 13 or 12 columns in total (depending on whether there are labels provided), including one column for each of the eleven metrics, one column for EIC numbers, and (optionally) one column for the class label.
This matrix serves as the input for the training the classifiers and making predictions.
MetaClean provides 8 classification algorithms (implemented with the R package
caret) for building a predictive model. These are: Decision Tree, Naive Bayes, Logistic Regression, RandomForest, SVM_Linear, AdaBoost, Neural Netowrk, and Model-Averaged Neural Networks. The
runCrossValidation function is a wrapper function that uses cross-validation to train a user-selected subset of the 8 available algorithms. It takes the following arguments:
## IF YOU HAVE INSTALLED MetaCleanData YOU CAN COMMENT OUT THIS CODE AND PROCEED WITH THE PEAK QUALITY METRIC TABLES GENERATED IN THE PREVIOUS SECTIONS # data("pqm_development") # data("pqm_test")
runCrossValidation returns a list of lists. The outer list has one entry for every model trained. The inner list has two entries: the trained model and the name of the model trained.
Once the potential models have been trained, the next step is to evaluate the performance of each to determine which is the best performing and should be selected as the classifier. To do this, we first generate the seven available evaluation measures: Pass_FScore, Pass_Precision, Pass_Recall, Fail_FScore, Fail_Precision, Fail_Recall, and Accuracy. We use the
getEvaluationMeasures function to do this. This function takes the following arguments:
# calculate all seven evaluation measures for each model and each round of cross-validation # evalMeasuresDF <- getEvaluationMeasures(models=models, k=5, repNum=10)
getEvaluationMeasures returns a dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation.
The evaluation measures data frame can be used to assess the performance of each of the algorithms. The most convenient way to compare the performance of each is with visualizations.
MetaClean provides a simple wrapper function to easily generate bar plot visualizations of the evaluation measures for each model. This is
getBarPlots which plots bar plots comparing each model across each of the evaluation measures. This function takes the following argument:
NOTE: When “All” is selected for emNames, the bar plots are returned in the same order as the names listed in the description.
# generate bar plots for every # barPlots <- getBarPlots(evalMeasuresDF, emNames="All") # # plot(barPlots$M11$Pass_FScore) # Pass_FScore # plot(barPlots$M11$Fail_FScore) # Fail_FScore # plot(barPlots$M11$Accuracy) # Accuracy
These plots can help the user select the best perfomring classifier for the data. Of course, the user can also employ their own statistical tests on the evalMeasuresDF itself to determine which class
Once the best perfomring model has been selected, the user can train the algorithm using all of the available training data and the optimized hyperparameters for the algorithm determined by training, to create the final classifier. The hyperparamters can be found in the “pred” data frame returned with the model by
runCrossValidation. The user must provide the hyperparamters as a data frame with the hyperparameter names as columns, as shown below.
# example of optimized hyperparameters for best performing model AdaBoost M11 # hyperparameters here are nIter = 150 and method = "Adaboost.M1" # View(models$AdaBoost_M11$pred) # best performing model for example development set, rand.seed = 453, k = 5, repNum = 10 is AdaBoost # hyperparameters <- models$AdaBoost_M11$pred[,c("nIter", "method")] # hyperparameters <- unique(hyperparameters) # # metaclean_model <- trainClassifier(trainData = pqm_development, # model = "AdaBoost", # metricSet = "M11", # hyperparameters = hyperparameters)
This classifier can then be saved to a directory specified by the user so it can used as many times as desired.
# uncomment the lines below and add path where you want to save trained model # model_file <- "" # saveRDS(metaclean_model, file=model_file)
The user can load any previously trained model (including the trained model from our publication available with package) and use it to make predictions on new data.
To load a prediction model, simply provide the path and use the base function readRDS().
As an example, users who have downloaded the data package MetaCleanData can use the pre-trained model included with that package:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # load model from MetaCleanData # data(example_model)
The user can then make predictions using this model using the predict() function from
caret as seen below:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # mc_predictions <- getPredicitons(model = example_model, # testData = pqm_test, # eicColumn = "EICNo")