ViralEntropR: A Computational Pipeline for Entropy-Informed Detection of Emerging Viral Variants

Implements an entropy-informed pipeline for detecting emerging variants in viral amino acid sequence data, extending prior clustering-based approaches including hemagglutinin clustering methods (Li et al., 2015) <doi:10.1142/9789814667944_0018>. Provides a fully vectorized FASTA preprocessing toolkit covering header parsing, two-pass date and country extraction, ambiguous-residue filtering, and integer encoding under a 25-symbol amino acid alphabet. Computes per-site Shannon entropy across user-defined cumulative, sliding, or disjoint temporal partitions and clusters per-site entropy values using Gaussian mixture models via 'mclust' (Scrucca et al., 2016) <doi:10.32614/RJ-2016-021>. Quantifies temporal distributional shifts between partitions using the Hellinger distance (van der Vaart, 1998) <doi:10.1017/CBO9780511802256>, and detects temporal change points non-parametrically using energy statistics (Matteson and James, 2014) <doi:10.1080/01621459.2013.849605> via 'ecp' or wild binary segmentation (Fryzlewicz, 2014) <doi:10.1214/14-AOS1245> via 'HDcpDetect'. Per-site amino-acid frequency tables and entropy trajectory plots characterize sequence composition and evolutionary dynamics across time. A configurable multi-variant simulation engine generates synthetic sequence time series with known ground truth for benchmarking detection pipelines. A curated dataset of SARS-CoV-2 Variants of Concern and Variants of Interest with associated lineage and surveillance metadata is included, along with a bundled National Center for Biotechnology Information (NCBI) Spike protein sample and vignettes demonstrating the full workflow.

Version: 0.6.2
Depends: R (≥ 3.5.0)
Imports: ggplot2 (≥ 3.4.0), grDevices, HDcpDetect, ecp, kableExtra, lubridate, magrittr, mclust, rlang, stats, stringr, utils, zoo
Suggests: Biostrings, DT, dplyr, here, knitr, readxl, rmarkdown, R.rsp, testthat (≥ 3.0.0)
Published: 2026-05-30
DOI: 10.32614/CRAN.package.ViralEntropR (may not be active yet)
Author: Vadim Tyuryaev ORCID iD [aut, cre], Jane Heffernan [aut], Hanna Jankowski [aut]
Maintainer: Vadim Tyuryaev <vadim.tyuryaev at gmail.com>
BugReports: https://github.com/vadimtyuryaev/ViralEntropR/issues
License: MIT + file LICENSE
URL: https://github.com/vadimtyuryaev/ViralEntropR, https://doi.org/10.5281/zenodo.19040165, https://vadimtyuryaev.github.io/ViralEntropR/
NeedsCompilation: no
Language: en-GB
Materials: README, NEWS
CRAN checks: ViralEntropR results

Documentation:

Reference manual: ViralEntropR.html , ViralEntropR.pdf
Vignettes: Unsupervised Recovery of SARS-CoV-2 Variant Structure via Entropy-Driven Site Selection and PAM Clustering: Precision, Recall, and F1 Evaluation Across Wild-Type and Delta-Dominated Surveillance Periods (source)
Entropy Clustering, Hellinger Distance, and Change Point Analysis for Emerging Viral Variant Detection: A Simulation Study (source)
NCBI SARS-CoV-2 Spike Protein Sequence Preprocessing: From Raw FASTA to an Analysis-Ready Integer-Encoded Matrix (source)

Downloads:

Package source: ViralEntropR_0.6.2.tar.gz
Windows binaries: r-devel: not available, r-release: not available, r-oldrel: not available
macOS binaries: r-release (arm64): not available, r-oldrel (arm64): not available, r-release (x86_64): not available, r-oldrel (x86_64): not available

Linking:

Please use the canonical form https://CRAN.R-project.org/package=ViralEntropR to link to this page.