This vignette provides a comprehensive, reproducible, step-by-step
walkthrough of the ViralEntropR pipeline:
For each partition, we calculate site-wise Shannon entropies, cluster these entropy profiles using a Gaussian Mixture Model (GMM), and identify which sites fall into the highest-entropy class. We then compute Hellinger distances between partitions using the first partition as the reference. If there are 12 partitions there are 11 comparisons; if a variant emerges in partition 3, the correct change point is index 2 (difference between partition 3 and partition 1).
simulate_variant_evolution()| Parameter | Description | Default |
|---|---|---|
ref_sequences |
data.frame which must include a Date column; OR a
single sequence string. |
— |
n_ref_months |
number of months for the reference phase | 3 |
start_date |
simulation start (character or Date) | — |
end_date |
simulation end (character or Date) | — |
variants_config |
integer vector, e.g., c(2,6): Variant 1 gets 2
mutations; Variant 2 gets 6. |
— |
variant_intervals |
gap (in months) between the emergence of consecutive variants | — |
n_new_mutations |
de-novo sequences produced at a variant’s first appearance | 1 |
mutation_rate |
monthly growth multiplier per variant (scalar or per-variant vector) | 2 |
mutation_rate_variability |
fractional spread ± around mutation_rate; 0 =
deterministic growth |
0.25 |
deleterious_rate |
probability that any new mutation is deleterious | 0 |
n_deleterious_limit |
maximum sequences allowed to carry a deleterious mutation | 1 |
n_sequences_total |
total sequences generated over the mutation phase (spread evenly per month) | 140 |
ref_variability |
if TRUE, introduce variability at sites 1 and 2 of the
reference pool |
FALSE |
n_ref_sequences |
when ref_sequences is a single string, number of
reference rows to create |
100 |
prob_deletion_event |
per-month probability of injecting a block of sequences with a random deletion | 0 |
n_rows_to_delete |
sequences affected when prob_deletion_event fires |
0 |
seed |
optional random seed for reproducibility | NULL |
In this simulation, simulate_variant_evolution() models
the stepwise emergence and competition of four viral variants over a
two-year period from 2020-01-01 through
2021-12-01. We begin with a 4-month reference phase
(n_ref_months = 4) in which 200 total reference sequences
are sampled (50 per month, n_ref_sequences = 200), and
variability is introduced at sites 1 and 2
(ref_variability = TRUE). The mutation phase then generates
1,000 sequences (n_sequences_total = 1000) at 50 per month.
Four variants emerge at months 5, 9, 14 and 19, carrying 2, 3, 4 and 2
mutations respectively (variants_config = c(2,3,4,2)), with
2 de-novo sequences at each emergence (n_new_mutations = 2)
and growth multipliers of 1.25 per variant (mutation_rate).
Although deleterious_rate = 0 for variant-originated
deleterious mutations, each month has a 20% chance
(prob_deletion_event = 0.2) of injecting a block of 3
sequences with a random deletion at one site
(n_rows_to_delete = 3). Finally, seed = 2025
ensures reproducibility. The returned list includes
Simulation_Output (with per-sequence Delet
flags), Variant_Details, Simulation_Dates, and
Delet_Records for downstream analysis.
Variants compete in a pairwise fashion, always
pitting the newest active variant against its immediate predecessor (or
the original reference when only one variant has emerged). During months
before any variant appears, all sequences are drawn from the reference
pool (the wild-type row plus up to two reference-variability rows when
ref_variability = TRUE). Once Variant 1 emerges, each
month’s batch is composed of its de-novo copies plus sampled reference
sequences; when Variant 2 appears it competes directly with Variant 1,
and so on — Variant 3 competes against Variant 2, Variant 4 against
Variant 3. The fill pool (sequences that pad the
monthly quota after the two competing variants are allocated) is
composed of the reference rows plus any older non-competing variants.
Sampling weights are assigned so that all reference rows carry weight 1,
while Variant i rows carry weight 2^(i+1),
ensuring that more recently emerged variants have higher selection
probability in subsequent months. Growth is stochastic:
each variant’s monthly count is drawn from
Uniform(mult*(1 - variability), mult*(1 + variability)) so
that mutation_rate_variability > 0 introduces realistic
fluctuation.
sim_result <- simulate_variant_evolution(
ref_sequences = ref_sequence, # single sequence string
n_ref_months = 4, # 4-month reference phase
start_date = "2020-01-01",
end_date = "2021-12-01",
variants_config = c(2, 3, 4, 2), # 4 variants: 2,3,4,2 mutations
variant_intervals = c(4, 5, 5), # V2 at month 9; V3 at 14; V4 at 19
n_new_mutations = 2, # 2 de-novo sequences at emergence
mutation_rate = c(1.25, 1.25, 1.25, 1.25), # per-variant growth multipliers
mutation_rate_variability = 0, # deterministic growth (no spread)
deleterious_rate = 0, # no within-variant deleterious mutations
n_deleterious_limit = 0,
n_sequences_total = 1000, # 1000/(24-4) = 50 per period
ref_variability = TRUE, # variability at sites 1 and 2
n_ref_sequences = 200, # 200/4 = 50 reference rows per month
prob_deletion_event = 0.2, # 20% per-month deletion probability
n_rows_to_delete = 3, # 3 sequences affected if event fires
seed = 2025
)The pool returned is the fill pool from the last multi-variant competition period. Column 25 = Variant, column 26 = Phase, column 30 = sampling weight.
print("Variant emergence period:")
#> [1] "Variant emergence period:"
sim_result$Variant_Details[[1]]$em
#> [1] 5print("Variant 1 sites:")
#> [1] "Variant 1 sites:"
sim_result$Variant_Details[[1]]$pos %>% sort()
#> [1] 12 15# Month 9 translates into partition 5.
print("Variant 2 emergence period:")
#> [1] "Variant 2 emergence period:"
sim_result$Variant_Details[[2]]$em
#> [1] 9print("Variant 2 sites:")
#> [1] "Variant 2 sites:"
sim_result$Variant_Details[[2]]$pos %>% sort()
#> [1] 6 13 23# Month 14 translates into partition 7.
print("Variant 3 emergence period:")
#> [1] "Variant 3 emergence period:"
sim_result$Variant_Details[[3]]$em
#> [1] 14print("Variant 3 sites:")
#> [1] "Variant 3 sites:"
sim_result$Variant_Details[[3]]$pos %>% sort()
#> [1] 5 14 16 24sprintf("The number of deleterious periods is %d",
length(sim_result$Delet_Records))
#> [1] "The number of deleterious periods is 1"
for (i in seq_along(sim_result$Delet_Records)) {
cat(sprintf("Deleterious Period %d: month %d, site %d\n",
i,
sim_result$Delet_Records[[i]]$period,
sim_result$Delet_Records[[i]]$site))
}
#> Deleterious Period 1: month 9, site 11The sections below walk through selected time points to verify that
the simulation is behaving as expected. Each month’s batch contains 50
sequences (reference + variant). Key columns to watch are the
mutated sites (which will show a non-reference amino
acid for variant sequences), Variant (which strain each
row belongs to), Phase (R = reference
wild-type, RV = reference with added site-1/2 variability,
M1/M2… = mutation phase), and
Delet (flags sequences affected by a random deletion
event).
During the reference phase only wild-type sequences are present.
Phase RV denotes reference sequences with low-level
variability artificially introduced at sites 1 and 2
(ref_variability = TRUE) to mimic the background noise
typical of real surveillance data.
The final reference month before any variant emerges. All sequences
should be Reference strain, with Phase R or
RV. This establishes the baseline amino acid distribution
against which downstream Hellinger distances will be measured.
Variant 1 Phase 1 (month 5): sites 12, 15Variant 1 emerges with 2 de-novo sequences
(n_new_mutations = 2). The mutated sites carry a different
amino acid to the reference. The remaining ~48 sequences are sampled
from the reference pool. Notice that the variant sequences are a very
small minority at this point — this is the detection challenge:
identifying emergence signal against a large reference background.
Reference rows can be either R or
RV.
sim_result$Simulation_Output[
sim_result$Simulation_Output$Period == 5,
c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]Variant 1 Phase 2 (month 6): sites 12, 15After one month of growth at rate 1.25, the number of Variant 1 sequences increases to 3. The variant is still a small fraction of the population, but its mutated sites are accumulating signal in the entropy profile.
sim_result$Simulation_Output[
sim_result$Simulation_Output$Period == 6,
c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]Variant 1 Phase 4 (month 8): sites 12, 15After three months of compounding growth
(ceiling(2 * 1.25^3)), Variant 1 now contributes 5
sequences. Its mutated sites are now consistently different from the
reference, and site entropy at those positions is rising — setting the
stage for detection in the next partition.
sim_result$Simulation_Output[
sim_result$Simulation_Output$Period == 8,
c(sim_result$Variant_Details[[1]]$pos %>% sort(), 25:29)
][1:10, ]Variant 2 Phase 1 (month 9): sites 6, 13, 23Variant 2 emerges with 2 de-novo sequences. Variants 1 and 2 are now
in direct pairwise competition. The remainder of the 50 sequences is
sampled from the fill pool (reference rows only at this stage, since
there are no older non-competing variants yet). Note that a random
deletion event also fires this month at site 11 — those affected rows
have Delet = "Yes" and carry a randomly chosen amino acid
substitution at that site.
Variant 2 Phase 5 (month 13): sites 6, 13, 23By month 13, both Variant 1 and Variant 2 have been growing for several months and together dominate the monthly batch. The reference pool contributes only the remaining fill sequences. The population is now clearly multi-strain, with entropy elevated at the mutation sites of both variants.
Variant 3 Phase 1 (month 14): sites 5, 14, 16, 24Variant 3 emerges and begins competing against Variant 2. Variant 1
is now relegated to the fill pool with an exponentially increasing
sampling weight (2^(1+1) = 4), meaning it still appears but
at lower frequency than the two active competitors.
toy_example <- sim_result$Simulation_Output
n_col <- nchar(ref_sequence)
colnames(toy_example)[seq_len(n_col)] <- as.character(seq_len(n_col))
# Encode character sequences to integers (A=1 … V=20, ambiguous 21–25, unknown=0)
char_mat <- as.matrix(toy_example[, seq_len(n_col)])
AL_Cat_mix <- as.data.frame(encode_aa_sequence(char_mat))
AL_Cat_mix[] <- lapply(AL_Cat_mix, as.integer)
# Standardise Date column to the first day of each month
AL_Cat_mix$Date <- as.Date(format(as.Date(toy_example$Date), "%Y-%m-01"))24 total months with a 4-month reference phase and 2-month non-overlapping windows give 12 partitions.
How partitioning works: Each partition contains all sequences observed within its time window. For each partition,
partition_time_windows()computes the Shannon entropy at every site and then callscluster_sites_by_entropy()to group sites into entropy classes via a Gaussian Mixture Model (GMM). The resultingDataFramehas three columns:
sites– the site index (position in the sequence)entropies– the Shannon entropy at that site for this partitionclass– the raw GMM cluster label assigned by MclustImportant: At this stage
classcarries raw Mclust labels ordered by ascending component mean — class 1 corresponds to the lowest-entropy group and the highest class integer corresponds to the highest-entropy group. Functionrelabel_entropy_classes()inverts this convention so that class 1 consistently denotes the highest-entropy group, making cross-partition comparison of class labels unambiguous. Sites with zero entropy (invariant across all sequences in the window) are excluded from GMM clustering and will not appear in the DataFrame.What to watch for: As variants emerge and grow, their mutation sites should accumulate increasing entropy across successive partitions, eventually clustering into the highest-entropy group. The sentinel class
999indicates all remaining sites have identical entropy (one undifferentiated group).
part_data <- partition_time_windows(
data = AL_Cat_mix,
n_sites = n_col,
window_length = 2,
window_type = 3, # non-overlapping / jumping
start_date = "2020-01-01",
end_date = "2022-01-01"
)During the reference phase all sites are invariant or near-invariant – no variant signal is present. Most sites are excluded (zero entropy) or form a single low-entropy group.
Variant 1 emerged in month 5 (May 2020). After two months of growth its mutation sites should begin to show elevated entropy relative to other sites. Watch for the Variant 1 mutation sites appearing with higher class labels.
Variant 2 emerged in month 9 (Sep 2020). Both Variant 1 and Variant 2 mutation sites should now be detectable as high-entropy clusters.
Variant 3 emerged in month 14 (Jan 2021). Its mutation sites (5, 14, 16, 24) should begin appearing as a new elevated-entropy cluster alongside the existing variant sites.
Variant 4 emerged in month 19 (Jul 2021). Its mutation sites (7, 22) add to the high-entropy pool. With four active variants, the entropy landscape is now complex and multi-modal.
The Hellinger distance quantifies the shift in amino acid distribution at each site between a reference time point (partition T1) and each subsequent partition. A distance of 0 means the distribution is identical to the reference; a distance approaching 1 (when normalized) means the distribution has changed maximally. Sites where a new variant has become prevalent will show large, sustained Hellinger distances after the variant’s emergence partition.
This is the key signal used by the change point detector: a sudden, sustained increase in Hellinger distance at a set of sites is the signature of a new variant entering the population.
The matrix has one row per site and one column per partition comparison (T2 vs T1, T3 vs T1, …, T12 vs T1). Rows corresponding to Variant 1 mutation sites (12, 15) should show a step-change in Hellinger distance starting around column T3 (the partition covering Variant 1’s first full window of growth).
dim(hellinger_mat)
#> [1] 24 11
hellinger_mat
#> T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
#> 1 0 0.10552099 0.02029592 0.3413535 0.1638659 0.1213214 0.05721778 0.03928696 0.00000000 0.1055210 0.1471665
#> 2 0 0.04610506 0.05945211 0.3433336 0.2362383 0.2097724 0.13660090 0.02415490 0.08537268 0.0000000 0.1011473
#> 3 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 4 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 5 0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 6 0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 7 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.22503584 0.3035154 0.4086193
#> 8 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 9 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 10 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 11 0 0.00000000 0.00000000 0.1738633 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 12 0 0.22503584 0.30351540 0.4086193 0.5396288 0.6810072 0.64229994 0.52871065 0.44721360 0.4217977 0.2468361
#> 13 0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 14 0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 15 0 0.22503584 0.30351540 0.4086193 0.5396288 0.6810072 0.64229994 0.52871065 0.44721360 0.4217977 0.2468361
#> 16 0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057
#> 17 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 18 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 19 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 20 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 21 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
#> 22 0 0.00000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.22503584 0.3035154 0.4086193
#> 23 0 0.00000000 0.00000000 0.2250358 0.3035154 0.4086193 0.53962881 0.70002365 0.65207241 0.5287106 0.4346527
#> 24 0 0.00000000 0.00000000 0.0000000 0.0000000 0.1417780 0.26696413 0.35190012 0.47155182 0.6125294 0.8022057The change point algorithms expect a time x features matrix – one row per time point, one column per site. We transpose the Hellinger matrix to obtain this format. Each row is now a multivariate observation: the full vector of Hellinger distances across all 24 sites at that partition comparison. A change point corresponds to a partition where this vector suddenly shifts.
dat_mat_t <- t(hellinger_mat)
dat_mat_t
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#> T2 0.00000000 0.00000000 0 0 0.0000000 0.0000000 0.0000000 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> T3 0.10552099 0.04610506 0 0 0.0000000 0.0000000 0.0000000 0 0 0 0.0000000 0.2250358 0.0000000 0.0000000 0.2250358
#> T4 0.02029592 0.05945211 0 0 0.0000000 0.0000000 0.0000000 0 0 0 0.0000000 0.3035154 0.0000000 0.0000000 0.3035154
#> T5 0.34135354 0.34333357 0 0 0.0000000 0.2250358 0.0000000 0 0 0 0.1738633 0.4086193 0.2250358 0.0000000 0.4086193
#> T6 0.16386589 0.23623829 0 0 0.0000000 0.3035154 0.0000000 0 0 0 0.0000000 0.5396288 0.3035154 0.0000000 0.5396288
#> T7 0.12132139 0.20977239 0 0 0.1417780 0.4086193 0.0000000 0 0 0 0.0000000 0.6810072 0.4086193 0.1417780 0.6810072
#> T8 0.05721778 0.13660090 0 0 0.2669641 0.5396288 0.0000000 0 0 0 0.0000000 0.6422999 0.5396288 0.2669641 0.6422999
#> T9 0.03928696 0.02415490 0 0 0.3519001 0.7000237 0.0000000 0 0 0 0.0000000 0.5287106 0.7000237 0.3519001 0.5287106
#> T10 0.00000000 0.08537268 0 0 0.4715518 0.6520724 0.2250358 0 0 0 0.0000000 0.4472136 0.6520724 0.4715518 0.4472136
#> T11 0.10552099 0.00000000 0 0 0.6125294 0.5287106 0.3035154 0 0 0 0.0000000 0.4217977 0.5287106 0.6125294 0.4217977
#> T12 0.14716651 0.10114733 0 0 0.8022057 0.4346527 0.4086193 0 0 0 0.0000000 0.2468361 0.4346527 0.8022057 0.2468361
#> 16 17 18 19 20 21 22 23 24
#> T2 0.0000000 0 0 0 0 0 0.0000000 0.0000000 0.0000000
#> T3 0.0000000 0 0 0 0 0 0.0000000 0.0000000 0.0000000
#> T4 0.0000000 0 0 0 0 0 0.0000000 0.0000000 0.0000000
#> T5 0.0000000 0 0 0 0 0 0.0000000 0.2250358 0.0000000
#> T6 0.0000000 0 0 0 0 0 0.0000000 0.3035154 0.0000000
#> T7 0.1417780 0 0 0 0 0 0.0000000 0.4086193 0.1417780
#> T8 0.2669641 0 0 0 0 0 0.0000000 0.5396288 0.2669641
#> T9 0.3519001 0 0 0 0 0 0.0000000 0.7000237 0.3519001
#> T10 0.4715518 0 0 0 0 0 0.2250358 0.6520724 0.4715518
#> T11 0.6125294 0 0 0 0 0 0.3035154 0.5287106 0.6125294
#> T12 0.8022057 0 0 0 0 0 0.4086193 0.4346527 0.8022057We apply three complementary Energy Change Point (ECP) methods to
dat_mat_t. All three operate on the same transposed
Hellinger matrix but differ in their algorithmic approach. A
change point here corresponds to a partition index at
which the multivariate Hellinger distance vector shifts significantly –
i.e., a partition where new mutation sites suddenly become highly
variable, indicating a new variant has entered the population.
Ground truth for comparison: The four variants emerged at months 5, 9, 14 and 19, which map to partition indices 3, 5, 7 and 10 respectively (using 2-month non-overlapping windows). The correct change points in the Hellinger series are therefore at indices 2, 4, 6 and 9 (one step behind, since Hellinger at T_k measures the shift relative to T1, so the change appears one index before the partition where the variant is first detectable).
ks.cp3o applied to all availabe dataks.cp3o uses dynamic programming to find the globally
optimal segmentation up to K change points. It is
deterministic and exact but requires specifying K (maximum
number of change points) in advance. minsize = 2 ensures
each segment contains at least 2 time points. This method is best used
when the approximate number of variants is known.
| Parameter | Description | Default |
|---|---|---|
Z |
A T x d matrix: T time points,
d features (sites). |
– |
K |
Maximum number of change points to detect. | 1 |
minsize |
Minimum segment length (number of time points per segment). | 30 |
verbose |
Print status updates during optimisation. | FALSE |
cp_m1 <- ks.cp3o(
Z = dat_mat_t,
K = 11, # upper bound: at most 11 CPs in 11 partition differences
minsize = 2, # each segment must span at least 2 partitions
verbose = FALSE
)
print("Optimal change point locations (cp_m1$estimates):")
#> [1] "Optimal change point locations (cp_m1$estimates):"
cp_m1$estimates
#> [1] 4 6 8 10
print("Full solution matrix across k = 1 to K change points (cp_m1$cpLoc):")
#> [1] "Full solution matrix across k = 1 to K change points (cp_m1$cpLoc):"
cp_m1$cpLoc
#> [[1]]
#> [1] 4
#>
#> [[2]]
#> [1] 4 7
#>
#> [[3]]
#> [1] 4 7 10
#>
#> [[4]]
#> [1] 4 6 8 10
#>
#> [[5]]
#> [1] 1 4 6 8 10detect_changepoints_ecpThis method applies ks.cp3o repeatedly with an expanding
window: starting from the first 3 partition differences and adding one
time point at a time. It simulates a real-time surveillance scenario
where data accumulates sequentially and you want to detect a change
point as early as possible, without knowing how many variants will
ultimately emerge. dynamic_k = TRUE sets K
adaptively to nrow(window) - 2 at each step.
lower <- 1
upper <- 3
timesteps <- nrow(dat_mat_t) - upper
ECP_res <- detect_changepoints_ecp(
data_matrix = dat_mat_t,
min_window = lower,
max_window = upper,
n_timesteps = timesteps,
rolling_window = FALSE,
dynamic_k = TRUE,
minsize = 2,
verbose = FALSE
)
# Union of all change points detected across all window sizes.
# Indicates which true change points the algorithm recovers
# at any stage of data accumulation (overall sensitivity).
ecp_indices <- ECP_res$ECP_est_list %>%
unlist() %>%
.[. > 1] %>%
unique() %>%
sort()
print("Change points detected at any window size (union):")
#> [1] "Change points detected at any window size (union):"
ecp_indices
#> [1] 3 4 5 6 7 8 10
# Early detection timeline: for each expanding window size, which true
# change points are detectable with the data available up to that point?
# Ground truth: variants emerged at partition indices 3, 5, 7, 10,
# corresponding to Hellinger change point indices 2, 4, 6, 9.
true_cps <- c(2, 4, 6, 9)
detection_timeline <- data.frame(
window_size = seq(upper, nrow(dat_mat_t)),
detected_cps = sapply(ECP_res$ECP_est_list, function(cps) {
detected <- intersect(true_cps, cps[cps > 1])
if (length(detected) == 0L) "none" else paste(sort(detected), collapse = ", ")
})
)
print("Early detection timeline — true change points recovered per window size:")
#> [1] "Early detection timeline — true change points recovered per window size:"
detection_timelinee.aggloe.agglo applies agglomerative hierarchical clustering
directly to the multivariate time series, merging adjacent segments
iteratively until a global goodness-of-fit criterion is maximised.
Unlike ks.cp3o, it does not require the number of change
points K to be specified in advance, making it well suited
to surveillance settings where the number of emerging variants is
unknown. Setting penalty = function(cps) 0 imposes no
regularisation penalty on model complexity, allowing the segmentation to
be driven entirely by the energy distance signal in the data.
The cluster output assigns each time point to a segment,
providing an intuitive segmentation of the surveillance timeline that
can be directly compared to known variant emergence periods.
| Parameter | Description | Default |
|---|---|---|
X |
A T x d matrix: T time points,
d features (sites). |
– |
member |
Initial segment membership vector (one integer per time point). | 1:nrow(X) |
alpha |
Moment index for the energy distance metric (1 = first moment). | 1 |
penalty |
Penalty function applied to the goodness-of-fit statistic. | – |
cp_m3 <- e.agglo(
X = dat_mat_t,
member = seq_len(nrow(dat_mat_t)), # naive: treat each time point as its own segment
alpha = 1,
penalty = function(cps) { 0 } # no penalty: let the data drive the segmentation
)
print("Change points estimated by method 3 (naive, no penalty):")
#> [1] "Change points estimated by method 3 (naive, no penalty):"
cp_m3$estimates
#> [1] 1 4 9 12
print("Segment membership per time point:")
#> [1] "Segment membership per time point:"
cp_m3$cluster
#> [1] 1 1 1 2 2 2 2 2 3 3 3Overall, no single method recovers all four true change points cleanly, which is expected given the simulation design: stochastic deletion events and the exponential growth competition model create a complex, heterogeneous signal. Method 3 is notable for detecting the latest true change point (9) which the other methods miss, while Method 1 provides the cleanest intermediate detection (4 and 6) and identifies latest change point (9) one period later (10).
Method 2 detects indices 4 and 6 but only once a sufficient number of
time points have accumulated in the expanding window — a direct
consequence of sample size. This lag is not a failure of the algorithm
but an inherent property of prospective surveillance: reliable change
point detection requires sufficient sample mass to warrant a shift in
the multivariate distribution. A formal analysis of the relationship
between sample size and detection reliability, including scenario
comparisons across varying sequence counts and variant growth rates, is
provided in analysis/Sample_Size_Simulation_Study/.
This vignette demonstrates the complete ViralEntropR
detection pipeline on a fully controlled, traceable example. By
constructing a synthetic evolutionary timeline with known variant
emergence points, mutation sites, and stochastic perturbations, we
establish a clear ground truth against which all pipeline components can
be evaluated directly. The simulation output confirms that entropy-based
site selection captures genuine evolutionary signal: mutation sites show
measurably elevated entropy from the partition following their variant’s
emergence, progressively transitioning into higher GMM entropy classes
as the variant grows in prevalence. This transition is not instantaneous
— newly emerged variants contribute only a small number of sequences
relative to the reference background, and their mutation sites require
one to several partitions of growth before the entropy shift becomes
large enough to enter the highest-entropy class. This detection lag is
an inherent property of any population-level surveillance system and
reflects the time required for a novel variant to accumulate sufficient
frequency to perturb the site-level amino acid distribution. Once the
shift is established, it propagates as a measurable, sustained increase
in Hellinger distance that is recoverable by all three change point
algorithms, with each method offering a different trade-off between
sensitivity and specificity as shown above.
The simulate_variant_evolution() function is designed to
be a general-purpose benchmarking tool for bioinformatics researchers.
The full parameter space, including number of variants, mutation counts
per variant, emergence timing, growth rates, stochastic deletion events,
reference variability, and random seeds allows the systematic generation
of arbitrarily large scenario libraries. Researchers can use this
infrastructure to stress-test detection algorithms under varying levels
of signal strength, variant competition, background noise, and sampling
density, and to compare the sensitivity and specificity of alternative
change point methods before applying them to real surveillance data. The
traceable ground truth embedded in every simulation output makes formal
evaluation of detection accuracy straightforward, as demonstrated in the
method comparison above.