Malu Calle and Toni Susin

Website of the project with examples and tutorials:


Understanding the role of the microbiome in human health and how it can be modulated is becoming increasingly relevant for preventive medicine and for the medical management of chronic diseases (Calle 2019). High-throughput sequencing technologies has boosted microbiome research but the compositional nature of microbiome data is a major challenge for their analysis.

Microbiome count data is compositional since their total are constrained by the sequencing depth. Relative abundances (proportions) are obviously constraint by a sum equal to one. This total constraint induces strong dependencies among the observed abundances of the different taxa. In fact, nor the absolute abundance (read counts) nor the relative abundance (proportion) of one taxon alone are informative of the real abundance of the taxon in the environment. Instead, they provide information on the relative measure of abundance when compared to the abundance of other taxa in the same sample.

We introduce a new package, coda4microbiome, that aims to bridge the gap between microbiome research and compositional data analysis (CoDA).

Package functionality

Our package provides a set of functions to explore and study microbiome data within the CoDA framework, with a special focus on identification of microbial signatures that can serve as biomarkers of disease risk and prognostic. Their prediction accuracy relies on the selection of the taxa that constitute the signature, which is challenging given the sparsity, multivariate and compositional inherent characteristics of microbiome data (Susin et al. 2020).

coda4microbiome performs variable selection through penalized regression in cross-sectional studies, with both binary and continuous outcome. In addition, the package incorporates a new approach for the analysis of longitudinal microbiome studies with a binary outcome.

Penalized regression implementation relies on the function cv.glmnet() from the R package glmnet (Friedman et al. 2010) adapted to CoDA by using all pairwise log-ratios of the variables (Bates and Tibshirani, 2018). The results are expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively (Susin et al. 2020).

The interpretability of results is of major importance in this context. The package provides several graphical representations for a better interpretation of the analysis and the identified microbial signatures.

Functions for microbial signature identification

Functions for log-ratio exploratory analysis

Previously or independently of variable selection for microbial signature identification, one may be interested in the exploratory analysis of pairwise log-ratios.

The interpretation of results of log-ratio analysis is challenging because when one taxon A is highly associated with the outcome, any log-ratio involving taxon A is likely to be associated with Y, no matter which is the second taxon involved in the log-ratio. Here we summarize the importance of each taxon A by aggregating the prediction accuracy of all log-ratios that involve taxon A.

Suplementary functions


Bates S and Tibshirani R (2019) Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 75(2):613-624.

Calle ML (2019) Statistical Analysis of Metagenomics Data. Genomics & Informatics 17 (1)

Friedman J, Hastie T, Tibshirani R (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1–22.

Susin A., Wang Y, Lê Cao K-A, Calle M.L. (2020) Variable selection in microbiome compositional data analysis. NAR Genomics and Bioinformatics, 2 (2)