Using pre-trained context-specific deconvolution models

digitalDLSorteR offers the possibility to use pre-trained context-specific deconvolution models included in the digitalDLSorteRmodels R package (https://github.com/diegommcc/digitalDLSorteRmodels) to deconvolute new bulk RNA-seq samples from the same biological environment. This is the simplest way to use digitalDLSorteR and only requires loading into R a raw bulk RNA-seq matrix with genes as rows (annotated as SYMBOL) and samples as columns, and selecting the desired model. This is done by the deconvDDLSPretrained function, which normalizes the new samples to counts per million (CPMs) by default, so this matrix must be provided as raw counts. Afterwards, estimated cell composition of each sample can be explored as a bar chart using the barPlotCellTypes function.

Available models

So far, available models only cover two possible biological environments: breast cancer and colorectal cancer. These models are able to accurately deconvolute new samples from the same environment as they have been trained on.

Breast cancer models

There are two deconvolution models for breast cancer samples that differ in the level of specificity. Both have been trained using data from Chung et al. (2017) (GSE75688).

breast.chung.generic: it considers 13 cell types, four of them being the intrinsic molecular subtypes of breast cancer (ER+, HER2+, ER+/HER2+ and TNBC) and the rest immune and stromal cells (Stromal, Monocyte, TCD4mem (memory CD4+ T cells), BGC (germinal center B cells), Bmem (memory B cells), DC (dendritic cells), Macrophage, TCD8 (CD8+ T cells) and TCD4reg (regulatory CD4+ T cells)).
breast.chung.generic: this model considers 7 cell types that are generic groups of the cell types considered by the specific version: B cells (Bcell), T CD4+ cells (TcellCD4), T CD8+ cells (TcellCD8), monocytes (Monocyte), dendritic cells (DCs), stromal cells (Stromal) and tumor cells (Tumor).

Colorectal cancer model

DDLS.colon.lee considers the following 22 cell types: Anti-inflammatory_MFs (macrophages), B cells, CD4+ T cells, CD8+ T cells, ECs (endothelial cells), ECs_tumor, Enterocytes, Epithelial cells, Epithelial_cancer_cells, MFs_SPP1+, Mast cells, Myofibroblasts, NK cells, Pericytes, Plasma_cells, Pro-inflammatory_MFs, Regulatory T cells, Smooth muscle cells, Stromal cells, T follicular helper cells, cDC (conventional dendritic cells), gamma delta T cells.

It has been generated using data from Lee, Hong, Etlioglu Cho et al., 2020 (GSE132465, GSE132257 and GSE144735). The genes selected to train the model were defined by obtaining the intersection between the scRNA-seq dataset and bulk RNA-seq data from the The Cancer Genome Atlas (TCGA) project (Koboldt et al. 2012; Ciriello et al. 2015) and using the digitalDLSorteR’s default parameters.

Example using colorectal samples from the TCGA project

The following code chunk shows an example using the DDLS.colon.lee model and data from TCGA loaded from digitalDLSorteRdata:

suppressMessages(library("digitalDLSorteR"))
# to load pre-trained models
if (!requireNamespace("digitalDLSorteRmodels", quietly = TRUE)) {
  remotes::install_github("diegommcc/digitalDLSorteRmodels")
}
suppressMessages(library(digitalDLSorteRmodels))
# data for examples
if (!requireNamespace("digitalDLSorteRdata", quietly = TRUE)) {
  remotes::install_github("diegommcc/digitalDLSorteRdata")
}
suppressMessages(library("digitalDLSorteRdata"))
suppressMessages(library("dplyr"))
suppressMessages(library("ggplot2"))

Loading data

# loading model from digitalDLSorteRmodel and example data from digitalDLSorteRdata
data("DDLS.colon.lee")
data("TCGA.colon.se")

DDLS.colon.lee is a DigitalDLSorterDNN object containing the trained model as well as specific information about it, such as cell types considered, number of epochs used during training, etc.

DDLS.colon.lee

## Trained model: 60 epochs
##   Training metrics (last epoch):
##     loss: 0.113
##     accuracy: 0.6851
##     mean_absolute_error: 0.0131
##     categorical_accuracy: 0.6851
##   Evaluation metrics on test data:
##     loss: 0.0979
##     accuracy: 0.7353
##     mean_absolute_error: 0.0117
##     categorical_accuracy: 0.7353
##   Performance evaluation over each sample: MAE MSE

Here you can check the cell types considered by the model:

cell.types(DDLS.colon.lee) %>% paste0(collapse = " / ")

## [1] "Anti-inflammatory_MFs / B cells / CD4+ T cells / CD8+ T cells / ECs / ECs_tumor / Enterocytes / Epithelial cells / Epithelial_cancer_cells / MFs_SPP1+ / Mast cells / Myofibroblasts / NK cells / Pericytes / Plasma_cells / Pro-inflammatory_MFs / Regulatory T cells / Smooth muscle cells / Stromal cells / T follicular helper cells / cDC / gamma delta T cells"

Now, we can use it to deconvolute TCGA.colon.se samples as follows:

# deconvolution
deconvResults <- deconvDDLSPretrained(
  data = TCGA.colon.se,
  model = DDLS.colon.lee,
  normalize = TRUE
)

## Error in deconvDDLSPretrained(data = TCGA.colon.se, model = DDLS.colon.lee, : could not find function "deconvDDLSPretrained"

rownames(deconvResults) <- paste("Sample", seq(nrow(deconvResults)), sep = "_")

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'nrow': object 'deconvResults' not found

head(deconvResults)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'deconvResults' not found

deconvDDLSPretrained returns a data frame with samples as rows (\(k\)) and cell types considered by the model as columns (\(j\)). Each entry corresponds to the proportion of cell type \(k\) in sample \(i\). To visually evaluate these results using a bar chart, you can use the barplotCellTypes function as follows:

barPlotCellTypes(
  deconvResults, 
  title = "Results of deconvolution of TCGA colon samples", rm.x.text = T
)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'barPlotCellTypes': object 'deconvResults' not found

Let’s take 40 random samples just to improve the visualization:

set.seed(123)
barPlotCellTypes(
  deconvResults[sample(1:nrow(deconvResults), size = 40), ], 
  title = "Results of deconvolution of TCGA colon samples", rm.x.text = T
)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'barPlotCellTypes': object 'deconvResults' not found

Finally, deconvDDLSPretrained also offers two parameters in case you want to simplify the results by aggregating cell proportions of similar cell types: simplify.set and simplify.majority. For instance, we can summarize different CD4+ T cell subtypes into a unique label by using the simplify.set parameter as follows:

# deconvolution
deconvResultsSum <- deconvDDLSPretrained(
  data = TCGA.colon.se,
  model = DDLS.colon.lee,
  normalize = TRUE,
  simplify.set = list(
    `CD4+ T cells` = c(
      "CD4+ T cells", 
      "T follicular helper cells", 
      "gamma delta T cells", 
      "Regulatory T cells"
    )
  )
)

## Error in deconvDDLSPretrained(data = TCGA.colon.se, model = DDLS.colon.lee, : could not find function "deconvDDLSPretrained"

rownames(deconvResultsSum) <- paste("Sample", seq(nrow(deconvResults)), sep = "_")

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'nrow': object 'deconvResults' not found

set.seed(123)
barPlotCellTypes(
  deconvResultsSum[sample(1:nrow(deconvResultsSum), size = 40), ], 
  title = "Results of deconvolution of TCGA colon samples", rm.x.text = T
)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'barPlotCellTypes': object 'deconvResultsSum' not found

On the other hand, simplify.majority does not create new classes but sums the proportions to the most abundant cell type from those provided in each sample. See the documentation for more details.