1 Introduction

This vignette provides a comprehensive, reproducible, step-by-step walkthrough of the ViralEntropR pipeline:

  • Simulate a realistic viral evolutionary timeline with four competing variants (2020–2021).
  • Partition the data into non-overlapping 2-month time windows.
  • Calculate Hellinger distance matrices to quantify distributional shifts.
  • Detect evolutionary change points using three complementary ECP approaches.

For each partition, we calculate site-wise Shannon entropies, cluster these entropy profiles using a Gaussian Mixture Model (GMM), and identify which sites fall into the highest-entropy class. We then compute Hellinger distances between partitions using the first partition as the reference. If there are 12 partitions there are 11 comparisons; if a variant emerges in partition 3, the correct change point is index 2 (difference between partition 3 and partition 1).

2 Setup

library(ViralEntropR)
library(dplyr)
library(lubridate)
library(ecp)
library(knitr)
library(kableExtra)

3 Reference Sequence

ref_sequence <- "MKTIIALSYIFCLVFADYKDDDDK"
sprintf("There are %d sites in the reference sequence", nchar(ref_sequence))
#> [1] "There are 24 sites in the reference sequence"

4 Simulation via simulate_variant_evolution()

Parameter Description Default
ref_sequences data.frame which must include a Date column; OR a single sequence string.
n_ref_months number of months for the reference phase 3
start_date simulation start (character or Date)
end_date simulation end (character or Date)
variants_config integer vector, e.g., c(2,6): Variant 1 gets 2 mutations; Variant 2 gets 6.
variant_intervals gap (in months) between the emergence of consecutive variants
n_new_mutations de-novo sequences produced at a variant’s first appearance 1
mutation_rate monthly growth multiplier per variant (scalar or per-variant vector) 2
mutation_rate_variability fractional spread ± around mutation_rate; 0 = deterministic growth 0.25
deleterious_rate probability that any new mutation is deleterious 0
n_deleterious_limit maximum sequences allowed to carry a deleterious mutation 1
n_sequences_total total sequences generated over the mutation phase (spread evenly per month) 140
ref_variability if TRUE, introduce variability at sites 1 and 2 of the reference pool FALSE
n_ref_sequences when ref_sequences is a single string, number of reference rows to create 100
prob_deletion_event per-month probability of injecting a block of sequences with a random deletion 0
n_rows_to_delete sequences affected when prob_deletion_event fires 0
seed optional random seed for reproducibility NULL

4.1 Simulate Data

In this simulation, simulate_variant_evolution() models the stepwise emergence and competition of four viral variants over a two-year period from 2020-01-01 through 2021-12-01. We begin with a 4-month reference phase (n_ref_months = 4) in which 200 total reference sequences are sampled (50 per month, n_ref_sequences = 200), and variability is introduced at sites 1 and 2 (ref_variability = TRUE). The mutation phase then generates 1,000 sequences (n_sequences_total = 1000) at 50 per month. Four variants emerge at months 5, 9, 14 and 19, carrying 2, 3, 4 and 2 mutations respectively (variants_config = c(2,3,4,2)), with 2 de-novo sequences at each emergence (n_new_mutations = 2) and growth multipliers of 1.25 per variant (mutation_rate). Although deleterious_rate = 0 for variant-originated deleterious mutations, each month has a 20% chance (prob_deletion_event = 0.2) of injecting a block of 3 sequences with a random deletion at one site (n_rows_to_delete = 3). Finally, seed = 2025 ensures reproducibility. The returned list includes Simulation_Output (with per-sequence Delet flags), Variant_Details, Simulation_Dates, and Delet_Records for downstream analysis.

Variants compete in a pairwise fashion, always pitting the newest active variant against its immediate predecessor (or the original reference when only one variant has emerged). During months before any variant appears, all sequences are drawn from the reference pool (the wild-type row plus up to two reference-variability rows when ref_variability = TRUE). Once Variant 1 emerges, each month’s batch is composed of its de-novo copies plus sampled reference sequences; when Variant 2 appears it competes directly with Variant 1, and so on — Variant 3 competes against Variant 2, Variant 4 against Variant 3. The fill pool (sequences that pad the monthly quota after the two competing variants are allocated) is composed of the reference rows plus any older non-competing variants. Sampling weights are assigned so that all reference rows carry weight 1, while Variant i rows carry weight 2^(i+1), ensuring that more recently emerged variants have higher selection probability in subsequent months. Growth is stochastic: each variant’s monthly count is drawn from Uniform(mult*(1 - variability), mult*(1 + variability)) so that mutation_rate_variability > 0 introduces realistic fluctuation.

sim_result <- simulate_variant_evolution(
  ref_sequences             = ref_sequence,            # single sequence string
  n_ref_months              = 4,                       # 4-month reference phase
  start_date                = "2020-01-01",
  end_date                  = "2021-12-01",
  variants_config           = c(2, 3, 4, 2),           # 4 variants: 2,3,4,2 mutations
  variant_intervals         = c(4, 5, 5),              # V2 at month 9; V3 at 14; V4 at 19
  n_new_mutations           = 2,                       # 2 de-novo sequences at emergence
  mutation_rate             = c(1.25, 1.25, 1.25, 1.25), # per-variant growth multipliers
  mutation_rate_variability = 0,                       # deterministic growth (no spread)
  deleterious_rate          = 0,                       # no within-variant deleterious mutations
  n_deleterious_limit       = 0,
  n_sequences_total         = 1000,                    # 1000/(24-4) = 50 per period
  ref_variability           = TRUE,                    # variability at sites 1 and 2
  n_ref_sequences           = 200,                     # 200/4 = 50 reference rows per month
  prob_deletion_event       = 0.2,                     # 20% per-month deletion probability
  n_rows_to_delete          = 3,                       # 3 sequences affected if event fires
  seed                      = 2025
)

4.2 Check the Pool

The pool returned is the fill pool from the last multi-variant competition period. Column 25 = Variant, column 26 = Phase, column 30 = sampling weight.

sim_result$Pool[, c(25, 26, 30)]

4.3 Emergence Periods and Sites

print("Variant emergence period:")
#> [1] "Variant emergence period:"
sim_result$Variant_Details[[1]]$em
#> [1] 5
print("Variant 1 sites:")
#> [1] "Variant 1 sites:"
sim_result$Variant_Details[[1]]$pos %>% sort()
#> [1] 12 15
# Month 9 translates into partition 5.
print("Variant 2 emergence period:")
#> [1] "Variant 2 emergence period:"
sim_result$Variant_Details[[2]]$em
#> [1] 9
print("Variant 2 sites:")
#> [1] "Variant 2 sites:"
sim_result$Variant_Details[[2]]$pos %>% sort()
#> [1]  6 13 23
# Month 14 translates into partition 7.
print("Variant 3 emergence period:")
#> [1] "Variant 3 emergence period:"
sim_result$Variant_Details[[3]]$em
#> [1] 14
print("Variant 3 sites:")
#> [1] "Variant 3 sites:"
sim_result$Variant_Details[[3]]$pos %>% sort()
#> [1]  5 14 16 24
# Month 19 translates into partition 10.
print("Variant 4 emergence period:")
#> [1] "Variant 4 emergence period:"
sim_result$Variant_Details[[4]]$em
#> [1] 19
print("Variant 4 sites:")
#> [1] "Variant 4 sites:"
sim_result$Variant_Details[[4]]$pos %>% sort()
#> [1]  7 22

4.4 Deleterious Periods and Sites

sprintf("The number of deleterious periods is %d",
        length(sim_result$Delet_Records))
#> [1] "The number of deleterious periods is 1"

for (i in seq_along(sim_result$Delet_Records)) {
  cat(sprintf("Deleterious Period %d: month %d, site %d\n",
              i,
              sim_result$Delet_Records[[i]]$period,
              sim_result$Delet_Records[[i]]$site))
}
#> Deleterious Period 1: month 9, site 11

4.5 Inspecting the Simulation Output

The sections below walk through selected time points to verify that the simulation is behaving as expected. Each month’s batch contains 50 sequences (reference + variant). Key columns to watch are the mutated sites (which will show a non-reference amino acid for variant sequences), Variant (which strain each row belongs to), Phase (R = reference wild-type, RV = reference with added site-1/2 variability, M1/M2… = mutation phase), and Delet (flags sequences affected by a random deletion event).

4.6 Reference Phase 1 (month 1)

During the reference phase only wild-type sequences are present. Phase RV denotes reference sequences with low-level variability artificially introduced at sites 1 and 2 (ref_variability = TRUE) to mimic the background noise typical of real surveillance data.

sim_result$Simulation_Output[1:10, ]

4.7 Reference Phase 4 (month 4)

The final reference month before any variant emerges. All sequences should be Reference strain, with Phase R or RV. This establishes the baseline amino acid distribution against which downstream Hellinger distances will be measured.

sim_result$Simulation_Output[sim_result$Simulation_Output$Period == 4, ][1:10, ]

4.8 Variant 1 Phase 1 (month 5): sites 12, 15

Variant 1 emerges with 2 de-novo sequences (n_new_mutations = 2). The mutated sites carry a different amino acid to the reference. The remaining ~48 sequences are sampled from the reference pool. Notice that the variant sequences are a very small minority at this point — this is the detection challenge: identifying emergence signal against a large reference background. Reference rows can be either R or RV.

sim_result$Simulation_Output[
  sim_result$Simulation_Output$Period == 5,
  c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]

4.9 Variant 1 Phase 2 (month 6): sites 12, 15

After one month of growth at rate 1.25, the number of Variant 1 sequences increases to 3. The variant is still a small fraction of the population, but its mutated sites are accumulating signal in the entropy profile.

sim_result$Simulation_Output[
  sim_result$Simulation_Output$Period == 6,
  c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]

4.10 Variant 1 Phase 4 (month 8): sites 12, 15

After three months of compounding growth (ceiling(2 * 1.25^3)), Variant 1 now contributes 5 sequences. Its mutated sites are now consistently different from the reference, and site entropy at those positions is rising — setting the stage for detection in the next partition.

sim_result$Simulation_Output[
  sim_result$Simulation_Output$Period == 8,
  c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]

4.11 Variant 2 Phase 1 (month 9): sites 6, 13, 23

Variant 2 emerges with 2 de-novo sequences. Variants 1 and 2 are now in direct pairwise competition. The remainder of the 50 sequences is sampled from the fill pool (reference rows only at this stage, since there are no older non-competing variants yet). Note that a random deletion event also fires this month at site 11 — those affected rows have Delet = "Yes" and carry a randomly chosen amino acid substitution at that site.

sim_result$Simulation_Output[sim_result$Simulation_Output$Period == 9, ]

4.12 Variant 2 Phase 5 (month 13): sites 6, 13, 23

By month 13, both Variant 1 and Variant 2 have been growing for several months and together dominate the monthly batch. The reference pool contributes only the remaining fill sequences. The population is now clearly multi-strain, with entropy elevated at the mutation sites of both variants.

sim_result$Simulation_Output[sim_result$Simulation_Output$Period == 13, ]

4.13 Variant 3 Phase 1 (month 14): sites 5, 14, 16, 24

Variant 3 emerges and begins competing against Variant 2. Variant 1 is now relegated to the fill pool with an exponentially increasing sampling weight (2^(1+1) = 4), meaning it still appears but at lower frequency than the two active competitors.

sim_result$Simulation_Output[sim_result$Simulation_Output$Period == 14, ]

5 Sites Detection and Change Point Detection

5.1 Encoding

toy_example <- sim_result$Simulation_Output
n_col       <- nchar(ref_sequence)

colnames(toy_example)[seq_len(n_col)] <- as.character(seq_len(n_col))

# Encode character sequences to integers (A=1 … V=20, ambiguous 21–25, unknown=0)
char_mat   <- as.matrix(toy_example[, seq_len(n_col)])
AL_Cat_mix <- as.data.frame(encode_aa_sequence(char_mat))
AL_Cat_mix[] <- lapply(AL_Cat_mix, as.integer)

# Standardise Date column to the first day of each month
AL_Cat_mix$Date <- as.Date(format(as.Date(toy_example$Date), "%Y-%m-01"))

5.1.1 View Post-Processed Matrix

AL_Cat_mix[1:10, ]

5.2 Partitioning

24 total months with a 4-month reference phase and 2-month non-overlapping windows give 12 partitions.

How partitioning works: Each partition contains all sequences observed within its time window. For each partition, partition_time_windows() computes the Shannon entropy at every site and then calls cluster_sites_by_entropy() to group sites into entropy classes via a Gaussian Mixture Model (GMM). The resulting DataFrame has three columns:

  • sites – the site index (position in the sequence)
  • entropies – the Shannon entropy at that site for this partition
  • class – the raw GMM cluster label assigned by Mclust

Important: At this stage class carries raw Mclust labels ordered by ascending component mean — class 1 corresponds to the lowest-entropy group and the highest class integer corresponds to the highest-entropy group. Function relabel_entropy_classes() inverts this convention so that class 1 consistently denotes the highest-entropy group, making cross-partition comparison of class labels unambiguous. Sites with zero entropy (invariant across all sequences in the window) are excluded from GMM clustering and will not appear in the DataFrame.

What to watch for: As variants emerge and grow, their mutation sites should accumulate increasing entropy across successive partitions, eventually clustering into the highest-entropy group. The sentinel class 999 indicates all remaining sites have identical entropy (one undifferentiated group).

part_data <- partition_time_windows(
  data          = AL_Cat_mix,
  n_sites       = n_col,
  window_length = 2,
  window_type   = 3,       # non-overlapping / jumping
  start_date    = "2020-01-01",
  end_date      = "2022-01-01"
)

5.2.1 Dates Included in the First Partition

part_data$Partitions[[1]]$Date %>% unique() %>% sort()
#> [1] "2020-01-01" "2020-02-01"

5.2.2 Dates Included in the Last Partition

part_data$Partitions[[part_data$N_partitions]]$Date %>% unique() %>% sort()
#> [1] "2021-11-01" "2021-12-01"

5.2.3 Sites and Classes in Partition 1 (Jan–Feb 2020, reference only)

During the reference phase all sites are invariant or near-invariant – no variant signal is present. Most sites are excluded (zero entropy) or form a single low-entropy group.

part_data$Clusters[[1]]$DataFrame

5.2.4 Sites and Classes in Partition 2 (Mar–Apr 2020, reference only)

part_data$Clusters[[2]]$DataFrame

5.2.5 Sites and Classes in Partition 3 (May–Jun 2020) vs. Variant 1 sites 12, 15

Variant 1 emerged in month 5 (May 2020). After two months of growth its mutation sites should begin to show elevated entropy relative to other sites. Watch for the Variant 1 mutation sites appearing with higher class labels.

part_data$Clusters[[3]]$DataFrame

5.2.6 Sites and Classes in Partition 4 (Jul–Aug 2020)

part_data$Clusters[[4]]$DataFrame

5.2.7 Sites and Classes in Partition 5 (Sep–Oct 2020) vs. Variant 1 12, 15 and Variant 2 6, 13, 23

Variant 2 emerged in month 9 (Sep 2020). Both Variant 1 and Variant 2 mutation sites should now be detectable as high-entropy clusters.

part_data$Clusters[[5]]$DataFrame

5.2.8 Sites and Classes in Partition 6 (Nov–Dec 2020) vs. Variant 1 12, 15 and Variant 2 6, 13, 23

part_data$Clusters[[6]]$DataFrame

5.2.9 Sites and Classes in Partition 7 (Jan–Feb 2021) vs. Variants 1–3

Variant 3 emerged in month 14 (Jan 2021). Its mutation sites (5, 14, 16, 24) should begin appearing as a new elevated-entropy cluster alongside the existing variant sites.

part_data$Clusters[[7]]$DataFrame

5.2.10 Sites and Classes in Partition 8 (Mar–Apr 2021) vs. Variants 1–3

part_data$Clusters[[8]]$DataFrame

5.2.11 Sites and Classes in Partition 9 (May–Jun 2021) vs. Variants 1–3

part_data$Clusters[[9]]$DataFrame

5.2.12 Sites and Classes in Partition 10 (Jul–Aug 2021) vs. Variants 1–4

Variant 4 emerged in month 19 (Jul 2021). Its mutation sites (7, 22) add to the high-entropy pool. With four active variants, the entropy landscape is now complex and multi-modal.

part_data$Clusters[[10]]$DataFrame

5.2.13 Sites and Classes in Partition 11 (Sep–Oct 2021) vs. Variants 1–4

part_data$Clusters[[11]]$DataFrame

5.2.14 Sites and Classes in Partition 12 (Nov–Dec 2021) vs. Variants 1–4

By the final partition all four variants have been circulating for several months. The highest-entropy cluster should contain a union of mutation sites from all four variants.

part_data$Clusters[[12]]$DataFrame

5.3 Hellinger Distance (HD)

The Hellinger distance quantifies the shift in amino acid distribution at each site between a reference time point (partition T1) and each subsequent partition. A distance of 0 means the distribution is identical to the reference; a distance approaching 1 (when normalized) means the distribution has changed maximally. Sites where a new variant has become prevalent will show large, sustained Hellinger distances after the variant’s emergence partition.

This is the key signal used by the change point detector: a sudden, sustained increase in Hellinger distance at a set of sites is the signature of a new variant entering the population.

5.3.1 Calculate HD

all_sites <- seq_len(n_col)

hellinger_mat <- calculate_hellinger_matrix(
  partitions = part_data$Partitions,
  sites      = all_sites,
  aa_levels  = 25L   # 25-character alphabet: 20 standard AAs + B, Z, X, *, -
)

5.3.2 Examine HD Structure: 24 sites x 11 partition differences

The matrix has one row per site and one column per partition comparison (T2 vs T1, T3 vs T1, …, T12 vs T1). Rows corresponding to Variant 1 mutation sites (12, 15) should show a step-change in Hellinger distance starting around column T3 (the partition covering Variant 1’s first full window of growth).

dim(hellinger_mat)
#> [1] 24 11
hellinger_mat
#>    T2         T3         T4        T5        T6        T7         T8         T9        T10       T11       T12
#> 1   0 0.10552099 0.02029592 0.3413535 0.1638659 0.1213214 0.05721778 0.03928696 0.00000000 0.1055210 0.1471665
#> 2   0 0.04610506 0.05945211 0.3433336 0.2362383 0.2097724 0.13660090 0.02415490 0.08537268 0.0000000 0.1011473
#> 3   0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 4   0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 5   0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 6   0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 7   0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.22503584 0.3035154 0.4086193
#> 8   0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 9   0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 10  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 11  0 0.00000000 0.00000000 0.1738633 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 12  0 0.22503584 0.30351540 0.4086193 0.5396288 0.6810072 0.64229994 0.52871065 0.44721360 0.4217977 0.2468361
#> 13  0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 14  0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 15  0 0.22503584 0.30351540 0.4086193 0.5396288 0.6810072 0.64229994 0.52871065 0.44721360 0.4217977 0.2468361
#> 16  0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 17  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 18  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 19  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 20  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 21  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 22  0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.22503584 0.3035154 0.4086193
#> 23  0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 24  0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057

5.3.3 Transpose for Change Point Detection

The change point algorithms expect a time x features matrix – one row per time point, one column per site. We transpose the Hellinger matrix to obtain this format. Each row is now a multivariate observation: the full vector of Hellinger distances across all 24 sites at that partition comparison. A change point corresponds to a partition where this vector suddenly shifts.

dat_mat_t <- t(hellinger_mat)
dat_mat_t
#>              1          2 3 4         5         6         7 8 9 10        11        12        13        14        15
#> T2  0.00000000 0.00000000 0 0 0.0000000 0.0000000 0.0000000 0 0  0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> T3  0.10552099 0.04610506 0 0 0.0000000 0.0000000 0.0000000 0 0  0 0.0000000 0.2250358 0.0000000 0.0000000 0.2250358
#> T4  0.02029592 0.05945211 0 0 0.0000000 0.0000000 0.0000000 0 0  0 0.0000000 0.3035154 0.0000000 0.0000000 0.3035154
#> T5  0.34135354 0.34333357 0 0 0.0000000 0.2250358 0.0000000 0 0  0 0.1738633 0.4086193 0.2250358 0.0000000 0.4086193
#> T6  0.16386589 0.23623829 0 0 0.0000000 0.3035154 0.0000000 0 0  0 0.0000000 0.5396288 0.3035154 0.0000000 0.5396288
#> T7  0.12132139 0.20977239 0 0 0.1417780 0.4086193 0.0000000 0 0  0 0.0000000 0.6810072 0.4086193 0.1417780 0.6810072
#> T8  0.05721778 0.13660090 0 0 0.2669641 0.5396288 0.0000000 0 0  0 0.0000000 0.6422999 0.5396288 0.2669641 0.6422999
#> T9  0.03928696 0.02415490 0 0 0.3519001 0.7000237 0.0000000 0 0  0 0.0000000 0.5287106 0.7000237 0.3519001 0.5287106
#> T10 0.00000000 0.08537268 0 0 0.4715518 0.6520724 0.2250358 0 0  0 0.0000000 0.4472136 0.6520724 0.4715518 0.4472136
#> T11 0.10552099 0.00000000 0 0 0.6125294 0.5287106 0.3035154 0 0  0 0.0000000 0.4217977 0.5287106 0.6125294 0.4217977
#> T12 0.14716651 0.10114733 0 0 0.8022057 0.4346527 0.4086193 0 0  0 0.0000000 0.2468361 0.4346527 0.8022057 0.2468361
#>            16 17 18 19 20 21        22        23        24
#> T2  0.0000000  0  0  0  0  0 0.0000000 0.0000000 0.0000000
#> T3  0.0000000  0  0  0  0  0 0.0000000 0.0000000 0.0000000
#> T4  0.0000000  0  0  0  0  0 0.0000000 0.0000000 0.0000000
#> T5  0.0000000  0  0  0  0  0 0.0000000 0.2250358 0.0000000
#> T6  0.0000000  0  0  0  0  0 0.0000000 0.3035154 0.0000000
#> T7  0.1417780  0  0  0  0  0 0.0000000 0.4086193 0.1417780
#> T8  0.2669641  0  0  0  0  0 0.0000000 0.5396288 0.2669641
#> T9  0.3519001  0  0  0  0  0 0.0000000 0.7000237 0.3519001
#> T10 0.4715518  0  0  0  0  0 0.2250358 0.6520724 0.4715518
#> T11 0.6125294  0  0  0  0  0 0.3035154 0.5287106 0.6125294
#> T12 0.8022057  0  0  0  0  0 0.4086193 0.4346527 0.8022057

5.4 Change Point Detection

We apply three complementary Energy Change Point (ECP) methods to dat_mat_t. All three operate on the same transposed Hellinger matrix but differ in their algorithmic approach. A change point here corresponds to a partition index at which the multivariate Hellinger distance vector shifts significantly – i.e., a partition where new mutation sites suddenly become highly variable, indicating a new variant has entered the population.

Ground truth for comparison: The four variants emerged at months 5, 9, 14 and 19, which map to partition indices 3, 5, 7 and 10 respectively (using 2-month non-overlapping windows). The correct change points in the Hellinger series are therefore at indices 2, 4, 6 and 9 (one step behind, since Hellinger at T_k measures the shift relative to T1, so the change appears one index before the partition where the variant is first detectable).

5.4.1 Method 1: Dynamic Programming via ks.cp3o applied to all availabe data

ks.cp3o uses dynamic programming to find the globally optimal segmentation up to K change points. It is deterministic and exact but requires specifying K (maximum number of change points) in advance. minsize = 2 ensures each segment contains at least 2 time points. This method is best used when the approximate number of variants is known.

Parameter Description Default
Z A T x d matrix: T time points, d features (sites).
K Maximum number of change points to detect. 1
minsize Minimum segment length (number of time points per segment). 30
verbose Print status updates during optimisation. FALSE
cp_m1 <- ks.cp3o(
  Z       = dat_mat_t,
  K       = 11,        # upper bound: at most 11 CPs in 11 partition differences
  minsize = 2,         # each segment must span at least 2 partitions
  verbose = FALSE
)

print("Optimal change point locations (cp_m1$estimates):")
#> [1] "Optimal change point locations (cp_m1$estimates):"
cp_m1$estimates
#> [1]  4  6  8 10

print("Full solution matrix across k = 1 to K change points (cp_m1$cpLoc):")
#> [1] "Full solution matrix across k = 1 to K change points (cp_m1$cpLoc):"
cp_m1$cpLoc
#> [[1]]
#> [1] 4
#> 
#> [[2]]
#> [1] 4 7
#> 
#> [[3]]
#> [1]  4  7 10
#> 
#> [[4]]
#> [1]  4  6  8 10
#> 
#> [[5]]
#> [1]  1  4  6  8 10

5.4.2 Method 2: Expanding Window via detect_changepoints_ecp

This method applies ks.cp3o repeatedly with an expanding window: starting from the first 3 partition differences and adding one time point at a time. It simulates a real-time surveillance scenario where data accumulates sequentially and you want to detect a change point as early as possible, without knowing how many variants will ultimately emerge. dynamic_k = TRUE sets K adaptively to nrow(window) - 2 at each step.

lower     <- 1
upper     <- 3
timesteps <- nrow(dat_mat_t) - upper

ECP_res <- detect_changepoints_ecp(
  data_matrix    = dat_mat_t,
  min_window     = lower,
  max_window     = upper,
  n_timesteps    = timesteps,
  rolling_window = FALSE,
  dynamic_k      = TRUE,
  minsize        = 2,
  verbose        = FALSE
)

# Union of all change points detected across all window sizes.
# Indicates which true change points the algorithm recovers
# at any stage of data accumulation (overall sensitivity).
ecp_indices <- ECP_res$ECP_est_list %>%
  unlist() %>%
  .[. > 1] %>%
  unique() %>%
  sort()

print("Change points detected at any window size (union):")
#> [1] "Change points detected at any window size (union):"
ecp_indices
#> [1]  3  4  5  6  7  8 10

# Early detection timeline: for each expanding window size, which true
# change points are detectable with the data available up to that point?
# Ground truth: variants emerged at partition indices 3, 5, 7, 10,
# corresponding to Hellinger change point indices 2, 4, 6, 9.
true_cps <- c(2, 4, 6, 9)

detection_timeline <- data.frame(
  window_size  = seq(upper, nrow(dat_mat_t)),
  detected_cps = sapply(ECP_res$ECP_est_list, function(cps) {
    detected <- intersect(true_cps, cps[cps > 1])
    if (length(detected) == 0L) "none" else paste(sort(detected), collapse = ", ")
  })
)

print("Early detection timeline — true change points recovered per window size:")
#> [1] "Early detection timeline — true change points recovered per window size:"
detection_timeline

5.4.3 Method 3: Agglomerative Hierarchical ECP via e.agglo

e.agglo applies agglomerative hierarchical clustering directly to the multivariate time series, merging adjacent segments iteratively until a global goodness-of-fit criterion is maximised. Unlike ks.cp3o, it does not require the number of change points K to be specified in advance, making it well suited to surveillance settings where the number of emerging variants is unknown. Setting penalty = function(cps) 0 imposes no regularisation penalty on model complexity, allowing the segmentation to be driven entirely by the energy distance signal in the data.

The cluster output assigns each time point to a segment, providing an intuitive segmentation of the surveillance timeline that can be directly compared to known variant emergence periods.

Parameter Description Default
X A T x d matrix: T time points, d features (sites).
member Initial segment membership vector (one integer per time point). 1:nrow(X)
alpha Moment index for the energy distance metric (1 = first moment). 1
penalty Penalty function applied to the goodness-of-fit statistic.
cp_m3 <- e.agglo(
  X       = dat_mat_t,
  member  = seq_len(nrow(dat_mat_t)),  # naive: treat each time point as its own segment
  alpha   = 1,
  penalty = function(cps) { 0 }        # no penalty: let the data drive the segmentation
)

print("Change points estimated by method 3 (naive, no penalty):")
#> [1] "Change points estimated by method 3 (naive, no penalty):"
cp_m3$estimates
#> [1]  1  4  9 12

print("Segment membership per time point:")
#> [1] "Segment membership per time point:"
cp_m3$cluster
#>  [1] 1 1 1 2 2 2 2 2 3 3 3

5.4.4 Comparing Methods Against Ground Truth

Overall, no single method recovers all four true change points cleanly, which is expected given the simulation design: stochastic deletion events and the exponential growth competition model create a complex, heterogeneous signal. Method 3 is notable for detecting the latest true change point (9) which the other methods miss, while Method 1 provides the cleanest intermediate detection (4 and 6) and identifies latest change point (9) one period later (10).

Method 2 detects indices 4 and 6 but only once a sufficient number of time points have accumulated in the expanding window — a direct consequence of sample size. This lag is not a failure of the algorithm but an inherent property of prospective surveillance: reliable change point detection requires sufficient sample mass to warrant a shift in the multivariate distribution. A formal analysis of the relationship between sample size and detection reliability, including scenario comparisons across varying sequence counts and variant growth rates, is provided in analysis/Sample_Size_Simulation_Study/.

6 Conclusion

This vignette demonstrates the complete ViralEntropR detection pipeline on a fully controlled, traceable example. By constructing a synthetic evolutionary timeline with known variant emergence points, mutation sites, and stochastic perturbations, we establish a clear ground truth against which all pipeline components can be evaluated directly. The simulation output confirms that entropy-based site selection captures genuine evolutionary signal: mutation sites show measurably elevated entropy from the partition following their variant’s emergence, progressively transitioning into higher GMM entropy classes as the variant grows in prevalence. This transition is not instantaneous — newly emerged variants contribute only a small number of sequences relative to the reference background, and their mutation sites require one to several partitions of growth before the entropy shift becomes large enough to enter the highest-entropy class. This detection lag is an inherent property of any population-level surveillance system and reflects the time required for a novel variant to accumulate sufficient frequency to perturb the site-level amino acid distribution. Once the shift is established, it propagates as a measurable, sustained increase in Hellinger distance that is recoverable by all three change point algorithms, with each method offering a different trade-off between sensitivity and specificity as shown above.

The simulate_variant_evolution() function is designed to be a general-purpose benchmarking tool for bioinformatics researchers. The full parameter space, including number of variants, mutation counts per variant, emergence timing, growth rates, stochastic deletion events, reference variability, and random seeds allows the systematic generation of arbitrarily large scenario libraries. Researchers can use this infrastructure to stress-test detection algorithms under varying levels of signal strength, variant competition, background noise, and sampling density, and to compare the sensitivity and specificity of alternative change point methods before applying them to real surveillance data. The traceable ground truth embedded in every simulation output makes formal evaluation of detection accuracy straightforward, as demonstrated in the method comparison above.

References

Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. 2nd ed. Wiley.
Gray, Robert M. 2011. Entropy and Information Theory. Springer.
Haynes, Kaylea, Paul Fearnhead, and Idris A. Eckley. 2017. “A Computationally Efficient Nonparametric Approach for Changepoint Detection.” Statistics and Computing 27 (5): 1293–1305. https://doi.org/10.1007/s11222-016-9687-5.
Matteson, David S., and Nicholas A. James. 2014. “A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data.” Journal of the American Statistical Association 109 (505): 334–45. https://doi.org/10.1080/01621459.2013.849605.
Székely, Gábor J., Maria L. Rizzo, and Nail K. Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” Annals of Statistics 35 (6): 2769–94. https://doi.org/10.1214/009053607000000505.
Vaart, Aad W. van der. 1998. Asymptotic Statistics. Cambridge University Press.