This vignette demonstrates the complete ViralEntropR
preprocessing pipeline for real-world SARS-CoV-2 amino acid sequence
data downloaded from the NCBI SARS-CoV-2 Data Hub. Starting from a raw
FASTA file, the pipeline produces a clean, integer-encoded data frame
with a Date column ready for
partition_time_windows() and all downstream entropy and
Hellinger distance analyses.
The pipeline consists of five steps:
Reproducibility note: The full 137,132-sequence dataset (~181.5 MB) is archived on Zenodo (DOI:
10.5281/zenodo.19040165).
The complete NCBI SARS-CoV-2 Spike protein dataset (137,132
sequences, ~181.5 MB) is archived on Zenodo under DOI 10.5281/zenodo.19040165.
Download sequences.fasta manually from the Zenodo record
and update the path in the code chunk below to match your local
copy.
fasta <- Biostrings::readAAStringSet(fasta_path)
sprintf("Total sequences loaded: %d", length(fasta))
#> [1] "Total sequences loaded: 137132"
sprintf("Unique sequences: %d", length(unique(fasta)))
#> [1] "Unique sequences: 17856"Inspect a sample of sequence headers to understand the metadata layout before extraction. NCBI Virus exports headers in a pipe-delimited format:
Accession | Country | Collection_Date |
# --- Overview of all 12 variants -------------------------------------------
data.frame(
WHO_Label = unlist(sarscov2_variants$WHO_Label),
Pango_Lineage = unlist(sarscov2_variants$Pango_Lineage),
GISAID_Clade = sarscov2_variants$GISAID_Clade,
Nextstrain = sarscov2_variants$Nextstrain_Clade,
First_Detected = sarscov2_variants$Date_First_Detected,
First_US = sarscov2_variants$Date_First_Detected_US
)# sarscov2_sample.fasta.gz contains 100 random sequences from the full NCBI dataset.
# Useful for testing pipeline code without downloading the full ~181.5 MB file.
path_sample <- system.file("extdata", "sarscov2_sample.fasta.gz",
package = "ViralEntropR")
fasta_sample <- Biostrings::readAAStringSet(path_sample)
sprintf("Sample sequences: %d", length(fasta_sample))
#> [1] "Sample sequences: 100"
# check NAMES structure
sample(fasta_sample@ranges@NAMES,1)
#> [1] "QWK63798.1 |USA|2021-05-17"
# Countries represented in the sample
sort(table(extract_fasta_countries(fasta_sample, position = 2)$countries),
decreasing = TRUE)
#>
#> USA Australia India New Zealand Bangladesh Chile Germany Saudi Arabia
#> 84 7 3 2 1 1 1 1All sequences in this dataset are 1,273 amino acids in length. No length filtering is required because the NCBI download was configured to retrieve only complete Spike protein sequences, which naturally match the canonical reference.
Why 1,273 amino acids? The SARS-CoV-2 Spike glycoprotein reference sequence (YP_009724390) is 1,273 amino acids long. This length corresponds to the full Spike precursor protein, comprising the signal peptide, the S1 subunit (receptor binding domain, N-terminal domain, and furin cleavage site), and the S2 subunit (fusion peptide, heptad repeats, transmembrane domain, and cytoplasmic tail). This fixed-length representation enforces positional homology by construction, allowing site indices to be interpreted consistently across sequences.
We first attempt full-precision date extraction
(yyyy-mm-dd) to identify sequences that only carry
year-month (yyyy-mm) or year-only (yyyy) dates
– both common in NCBI submissions.
# Pass 1: attempt full yyyy-mm-dd extraction
dates_ymd <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12][0-9]|3[01]|00)"
)
sprintf("Sequences with full yyyy-mm-dd dates: %d",
sum(!is.na(dates_ymd$raw_date_strings)))
#> [1] "Sequences with full yyyy-mm-dd dates: 133864"
sprintf("Sequences missing full date: %d",
sum(is.na(dates_ymd$raw_date_strings)))
#> [1] "Sequences missing full date: 3268"# Inspect headers of sequences that lack a full yyyy-mm-dd date
if (length(dates_ymd$missing_id) > 0) {
cat("Sample headers with partial/missing dates:\n")
cat(paste(head(names(fasta)[dates_ymd$missing_id], 5), collapse = "\n"))
}
#> Sample headers with partial/missing dates:
#> YP_009724390.1 |China|2019-12
#> UBX21225.1 |USA|2020-09
#> UBX21237.1 |USA|2020-09
#> UBX21249.1 |USA|2020-09
#> UBX21261.1 |USA|2020-09For monthly partitioning, yyyy-mm precision is
sufficient and more inclusive than requiring the full day. We re-extract
dates in yyyy-mm format. Any sequence that only has a year
(yyyy) with no month will still fail and must be
removed.
# Pass 2: extract yyyy-mm (covers both yyyy-mm-dd and yyyy-mm submissions)
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
sprintf("Sequences with parseable yyyy-mm dates: %d",
sum(!is.na(dates_result$raw_date_strings)))
#> [1] "Sequences with parseable yyyy-mm dates: 136705"
sprintf("Sequences still missing a date: %d",
sum(is.na(dates_result$raw_date_strings)))
#> [1] "Sequences still missing a date: 427"# Sequences with year-only dates -- these cannot be placed in a monthly window
if (length(dates_result$missing_id) > 0) {
cat("Headers with year-only dates (to be removed):\n")
cat(paste(head(names(fasta)[dates_result$missing_id], 5), collapse = "\n"))
}
#> Headers with year-only dates (to be removed):
#> UAU30010.1 |USA|2021
#> QZR93642.1 ||
#> QZR93654.1 ||
#> QZR93666.1 ||
#> QZR93678.1 ||# Remove sequences with year-only dates
if (length(dates_result$missing_id) > 0) {
fasta <- fasta[-dates_result$missing_id]
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
}
cat("Date extraction status after removal:", dates_result$message, "\n")
#> Date extraction status after removal: All date strings have been successfully extracted
sprintf("Sequences remaining: %d", length(fasta))
#> [1] "Sequences remaining: 136705"Visualising the date distribution helps verify that the data spans the expected surveillance window and that no anomalous date clusters are present.
dates_df_before <- data.frame(
Date = as.Date(dates_result$corrected_dates)
) %>%
mutate(YearMonth = format(Date, "%Y-%m"))
ggplot(dates_df_before, aes(x = YearMonth)) +
geom_bar(stat = "count", fill = "steelblue") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8),
plot.title = element_text(hjust = 0.5)
) +
labs(
x = "Collection Month",
y = "Number of Sequences",
title = "Sequence Count by Collection Month (before ambiguous residue filtering)"
)# Extract country field (position 2 = between first and second pipe)
countries_result <- extract_fasta_countries(fasta, position = 2)
sprintf("Sequences with parseable country: %d",
sum(!is.na(countries_result$countries)))
#> [1] "Sequences with parseable country: 136702"
sprintf("Sequences with missing country: %d",
sum(is.na(countries_result$countries)))
#> [1] "Sequences with missing country: 3"# Inspect headers with missing countries before removing
if (length(countries_result$missing_id) > 0) {
cat("Sample headers with missing country field:\n")
cat(paste(head(names(fasta)[countries_result$missing_id], 5), collapse = "\n"))
# Remove sequences with missing countries
fasta <- fasta[-countries_result$missing_id]
countries_result <- extract_fasta_countries(fasta, position = 2)
}
#> Sample headers with missing country field:
#> BCM16162.1 ||2020-03
#> BCA87361.1 ||2020-02-10
#> BCA87371.1 ||2020-02-10
cat("Country extraction status after removal:", countries_result$message, "\n")
#> Country extraction status after removal: All countries have been extracted
sprintf("Sequences remaining: %d", length(fasta))
#> [1] "Sequences remaining: 136702"# Re-extract dates after removing sequences with missing countries
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
# Confirm no remaining missing values in either field
sprintf("Missing dates remaining: %d", sum(is.na(dates_result$corrected_dates)))
#> [1] "Missing dates remaining: 0"
sprintf("Missing countries remaining: %d", sum(is.na(countries_result$countries)))
#> [1] "Missing countries remaining: 0"# Country distribution -- sequences per country of collection
countries_result$countries %>%
table() %>%
sort(decreasing = TRUE)
#> .
#> USA Australia India Egypt Mexico Japan New Zealand Bangladesh
#> 119385 9571 1132 896 891 609 583 438
#> Chile Pakistan Hong Kong China Bahrain Poland Serbia Saudi Arabia
#> 290 229 203 167 147 147 146 144
#> Israel Ghana Peru Greece France Italy Germany Spain
#> 119 116 104 97 89 89 81 61
#> West Bank Sierra Leone Tunisia Turkey Brazil Djibouti Taiwan Iraq
#> 58 56 56 51 50 50 48 47
#> Argentina South Korea Kenya Myanmar Philippines Russia Czech Republic Jordan
#> 34 31 30 26 24 24 23 22
#> Portugal Venezuela Georgia Thailand Netherlands Austria Uruguay Uzbekistan
#> 22 22 21 20 18 15 14 14
#> Denmark United Kingdom Benin Iran Morocco Uganda Finland Libya
#> 13 13 12 12 12 12 11 11
#> Guatemala Jamaica Malaysia Timor-Leste Ethiopia Gambia Canada Kazakhstan
#> 10 8 8 8 6 6 5 5
#> Sri Lanka Belize Ecuador Lebanon Malta Colombia Guinea Mali
#> 5 4 4 4 4 2 2 2
#> Nigeria Viet Nam Belarus Belgium Cambodia Nepal South Africa Sweden
#> 2 2 1 1 1 1 1 1
#> Switzerland Togo Zambia
#> 1 1 1fasta_to_char_matrix() converts the
AAStringSet object to a plain R character matrix (rows =
sequences, columns = sites). This fully vectorized implementation is
substantially faster than row-by-row strsplit approaches on
large datasets.
char_mat <- fasta_to_char_matrix(fasta)
sprintf("Character matrix dimensions: %d sequences x %d sites",
nrow(char_mat), ncol(char_mat))
#> [1] "Character matrix dimensions: 136702 sequences x 1273 sites"
# Inspect a small region
char_mat[1:5, 1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L"
#> [2,] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L"
#> [3,] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L"
#> [4,] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L"
#> [5,] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L"Ambiguous amino acid codes (B, X, Z) and unrecognised characters
(encoded as 0) can distort entropy and Hellinger distance calculations
by inflating apparent diversity at a site.
filter_ambiguous_sequences() removes any sequence
containing at least one such residue.
option = 2 specifies a character matrix input (as
returned by fasta_to_char_matrix()).
filtered <- filter_ambiguous_sequences(char_mat, option = 2)
cat(filtered$OriginalDim, "\n")
#> Number of original sequences is 136702
cat(filtered$NewDim, "\n")
#> Number of filtered sequences is 125130
cat(filtered$NumberAmbiguous, "\n")
#> Number of sequences containing at least one of B, X, Z or J characters is 11572
cat(filtered$RangeAmbiguous, "\n")
#> Number of ambiguous protein characters per sequence varies between 1 and 287# Extract the clean matrix and align metadata to retained rows
clean_char_mat <- filtered$FilteredMatrix
deleted_rows <- filtered$DeletedSeqId
# Remove deleted rows from date and country vectors
corrected_dates <- dates_result$corrected_dates[-deleted_rows]
countries <- countries_result$countries[-deleted_rows]
sprintf("Clean matrix: %d sequences x %d sites",
nrow(clean_char_mat), ncol(clean_char_mat))
#> [1] "Clean matrix: 125130 sequences x 1273 sites"encode_aa_sequence() maps each character to its integer
code using the standard 25-character ViralEntropR alphabet.
After ambiguous residue filtering only codes 1–20 (the 20 standard amino
acids) should remain.
| Codes | Characters | Meaning |
|---|---|---|
| 1–20 | A R N D C Q E G H I L K M F P S T W Y V | Standard amino acids |
| 21 | B | Asp/Asn ambiguity |
| 22 | Z | Glu/Gln ambiguity |
| 23 | X | Any amino acid (unknown) |
| 24 | * | Stop codon |
| 25 | - | Gap |
| 0 | (other) | Unrecognised character |
We combine the integer matrix with the aligned date and country
metadata into a single data frame, standardise dates to the first of
each month for monthly partitioning, and sort by date. This object is
passed directly to partition_time_windows().
n_sites <- ncol(int_mat)
# Build the data frame
AL_df <- as.data.frame(int_mat)
colnames(AL_df) <- as.character(seq_len(n_sites))
AL_df[] <- lapply(AL_df, as.integer)
# Standardise to first of each month for consistent monthly partitioning
AL_df$Date <- as.Date(format(corrected_dates, "%Y-%m-01"))
AL_df$Country <- countries
# Sort by date -- required by partition_time_windows()
AL_df <- AL_df[order(AL_df$Date), ]
rownames(AL_df) <- NULL
sprintf("Final data frame: %d sequences x %d site columns + Date + Country",
nrow(AL_df), n_sites)
#> [1] "Final data frame: 125130 sequences x 1273 site columns + Date + Country"# Country distribution in the clean final dataset
AL_df$Country %>% table() %>% sort(decreasing = TRUE)
#> .
#> USA Australia India Mexico Egypt Japan New Zealand Bangladesh
#> 109536 8473 1007 869 797 563 551 424
#> Chile Pakistan Hong Kong China Serbia Poland Saudi Arabia Bahrain
#> 286 218 187 164 144 140 133 128
#> Israel Greece Peru France Italy Germany Spain West Bank
#> 118 97 91 89 82 75 58 54
#> Sierra Leone Tunisia Djibouti Brazil Taiwan Iraq Turkey Argentina
#> 52 52 50 48 48 43 43 33
#> South Korea Ghana Myanmar Philippines Russia Jordan Venezuela Georgia
#> 31 26 26 24 24 22 22 21
#> Czech Republic Thailand Portugal Netherlands Kenya Austria Uruguay Benin
#> 20 20 19 17 16 15 14 12
#> Iran Morocco Finland Libya Guatemala Uzbekistan Jamaica United Kingdom
#> 12 12 11 11 10 9 8 8
#> Gambia Kazakhstan Malaysia Canada Denmark Ecuador Malta Sri Lanka
#> 6 5 5 4 4 4 4 4
#> Belize Ethiopia Lebanon Colombia Guinea Mali Nigeria Timor-Leste
#> 3 3 3 2 2 2 2 2
#> Viet Nam Belarus Belgium Cambodia Nepal South Africa Sweden Switzerland
#> 2 1 1 1 1 1 1 1
#> Togo Uganda Zambia
#> 1 1 1Compare the final date distribution against the earlier plot to confirm that the filtering steps have not introduced systematic temporal bias.
ggplot(AL_df, aes(x = format(Date, "%Y-%m"))) +
geom_bar(stat = "count", fill = "steelblue") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8),
plot.title = element_text(hjust = 0.5)
) +
labs(
x = "Collection Month",
y = "Number of Sequences",
title = "Sequence Count by Collection Month (clean dataset, ready for analysis)"
)The data frame AL_df is now ready for the full
ViralEntropR pipeline. The site columns contain
integer-encoded amino acids (1–20), Date contains
first-of-month dates for monthly partitioning, and Country
is available for geographic stratification.
# Example: partition into non-overlapping 2-month windows
part_data <- partition_time_windows(
data = AL_df,
n_sites = n_sites,
window_length = 2,
window_type = 3, # non-overlapping / jumping windows
verbose = FALSE
)
sprintf("Number of 2-month partitions created: %d", part_data$N_partitions)
#> [1] "Number of 2-month partitions created: 10"
cat("\nPartition date labels:\n")
#>
#> Partition date labels:
cat(paste(part_data$Dates_Labels, collapse = "\n"))
#> Dec-2019 - Jan-2020
#> Feb-2020 - Mar-2020
#> Apr-2020 - May-2020
#> Jun-2020 - Jul-2020
#> Aug-2020 - Sep-2020
#> Oct-2020 - Nov-2020
#> Dec-2020 - Jan-2021
#> Feb-2021 - Mar-2021
#> Apr-2021 - May-2021
#> Jun-2021 - Jul-2021# Entropy-based site clustering for the first partition
cat("\nSite clustering for partition 1 (reference period):\n")
#>
#> Site clustering for partition 1 (reference period):
part_data$Clusters[[1]]$DataFrameAs the output shows, entropy-based analysis of early pandemic sequences already flags several sites that subsequently proved critical across multiple VOCs and VOIs:
Site 95 (T95I) – located in the N-terminal domain (NTD) of the Spike protein. The T95I substitution alters the topology of the NTD neutralization “supersite”, reducing antibody recognition, and is associated with increased viral load. It is a defining mutation of the Iota (B.1.526) variant and also appears in Delta and Kappa (Magazine et al. 2022).
Site 655 (H655Y) – located immediately upstream of the furin cleavage site at the S1/S2 boundary. The H655Y substitution enhances spike protein cleavage efficiency and viral replication, and has been described as a “gateway mutation” that promotes fitness in emerging lineages. It is a defining feature of Gamma (P.1) and Omicron (Escalera et al. 2022).
Site 677 (Q677H/P) – situated only three residues upstream of the polybasic furin cleavage site (RRAR). Mutations at this position are thought to promote conformational changes that enhance accessibility of the S1/S2 junction to proteolytic cleavage, increasing viral entry efficiency. Q677H independently evolved in at least six distinct US lineages in late 2020, a striking example of convergent evolution under positive selection (Zeng et al. 2021).
The early detection of these sites by entropy clustering – before the
variants carrying them were formally designated – illustrates the
prospective surveillance potential of the ViralEntropR
pipeline.
# Entropy-based site clustering for the last partition
cat("\nSite clustering for partition 10 (final partition):\n")
#>
#> Site clustering for partition 10 (final partition):
part_data$Clusters[[10]]$DataFrameFrom here, pass part_data$Partitions to
calculate_hellinger_matrix() to compute Hellinger distances
across time windows, then transpose and apply
detect_changepoints_ecp() or e.agglo() for
change point detection. See the companion vignette “Entropy
Clustering, Hellinger Distance, and Change Point Analysis for Emerging
Viral Variant Detection: A Simulation Study” for a complete worked
example.
Download sequences.fasta manually from Zenodo (https://zenodo.org/records/19040165)
|
v
readAAStringSet() # Step 1: Load FASTA into R
|
v
Confirm width() == 1273L # Step 2: All sequences match canonical Spike length
# (YP_009724390 -- no filtering required)
|
v
extract_fasta_dates() # Step 3: Pass 1 -- attempt yyyy-mm-dd
-> re-extract with yyyy-mm # Pass 2 -- broader pattern, remove year-only
|
v
extract_fasta_countries() # Step 4: Extract country field from header
-> remove missing # Iterative: remove, re-extract, confirm
|
v
fasta_to_char_matrix() # Step 5: AAStringSet -> character matrix (rows x sites)
|
v
filter_ambiguous_sequences() # Remove sequences with B, X, Z, unknown
-> align date/country vectors # Drop same rows from metadata
|
v
encode_aa_sequence() # Character -> integer (25-AA alphabet)
|
v
Assemble data frame # Add Date (yyyy-mm-01) + Country, sort
|
v
partition_time_windows() # Downstream: partition into time windows
| # -> entropy clustering per partition
v
calculate_hellinger_matrix() # Hellinger distances (sites x partitions)
|
v
detect_changepoints_ecp() # Change point detection