1 Introduction

This vignette demonstrates the complete ViralEntropR preprocessing pipeline for real-world SARS-CoV-2 amino acid sequence data downloaded from the NCBI SARS-CoV-2 Data Hub. Starting from a raw FASTA file, the pipeline produces a clean, integer-encoded data frame with a Date column ready for partition_time_windows() and all downstream entropy and Hellinger distance analyses.

The pipeline consists of five steps:

Download and load the full FASTA dataset.
Confirm the canonical Spike protein length (1,273 aa).
Extract and validate collection dates and countries from FASTA headers.
Remove sequences with missing or unparseable metadata.
Convert, filter, encode, and assemble the final analysis-ready data frame.

Reproducibility note: The full 137,132-sequence dataset (~181.5 MB) is archived on Zenodo (DOI: 10.5281/zenodo.19040165).

2 Setup

library(ViralEntropR)
library(Biostrings)
library(dplyr)
library(ggplot2)

3 Step 1: Load the FASTA Dataset

3.1 Download Data

The complete NCBI SARS-CoV-2 Spike protein dataset (137,132 sequences, ~181.5 MB) is archived on Zenodo under DOI 10.5281/zenodo.19040165. Download sequences.fasta manually from the Zenodo record and update the path in the code chunk below to match your local copy.

3.2 Update Path

fasta_path <- "path/to/sequences.fasta"   # update this

3.3 Read FASTA file

fasta <- Biostrings::readAAStringSet(fasta_path)
sprintf("Total sequences loaded:  %d", length(fasta))
#> [1] "Total sequences loaded:  137132"
sprintf("Unique sequences:        %d", length(unique(fasta)))
#> [1] "Unique sequences:        17856"

Inspect a sample of sequence headers to understand the metadata layout before extraction. NCBI Virus exports headers in a pipe-delimited format:

Accession | Country | Collection_Date |

# Print 5 randomly sampled headers to see the field layout
set.seed(42)
cat(paste(sample(names(fasta), 5), collapse = "\n\n"))
#> QTW64456.1 |USA|2021-03-28
#> 
#> QUA73175.1 |USA|2021-04-04
#> 
#> QRG16329.1 |USA|2020-06-28
#> 
#> QTF85499.1 |USA|2021-03-08
#> 
#> QUV31133.1 |USA|2021-04-20

3.4 Curated metadata for 12 SARS-CoV-2 VOCs and VOIs


# --- Overview of all 12 variants -------------------------------------------

data.frame(
  WHO_Label      = unlist(sarscov2_variants$WHO_Label),
  Pango_Lineage  = unlist(sarscov2_variants$Pango_Lineage),
  GISAID_Clade   = sarscov2_variants$GISAID_Clade,
  Nextstrain     = sarscov2_variants$Nextstrain_Clade,
  First_Detected = sarscov2_variants$Date_First_Detected,
  First_US       = sarscov2_variants$Date_First_Detected_US
)

# To learn more
?sarscov2_variants

3.5 Bundled FASTA sample

# sarscov2_sample.fasta.gz contains 100 random sequences from the full NCBI dataset.
# Useful for testing pipeline code without downloading the full ~181.5 MB file.

path_sample  <- system.file("extdata", "sarscov2_sample.fasta.gz",
                     package = "ViralEntropR")
fasta_sample <- Biostrings::readAAStringSet(path_sample)

sprintf("Sample sequences: %d", length(fasta_sample))
#> [1] "Sample sequences: 100"

# check NAMES structure
sample(fasta_sample@ranges@NAMES,1)
#> [1] "QWK63798.1 |USA|2021-05-17"

# Countries represented in the sample
sort(table(extract_fasta_countries(fasta_sample, position = 2)$countries),
     decreasing = TRUE)
#> 
#>          USA    Australia        India  New Zealand   Bangladesh        Chile      Germany Saudi Arabia 
#>           84            7            3            2            1            1            1            1

# To learn more
?sarscov2_sample

4 Step 2: Canonical Spike Protein Length

All sequences in this dataset are 1,273 amino acids in length. No length filtering is required because the NCBI download was configured to retrieve only complete Spike protein sequences, which naturally match the canonical reference.

Why 1,273 amino acids? The SARS-CoV-2 Spike glycoprotein reference sequence (YP_009724390) is 1,273 amino acids long. This length corresponds to the full Spike precursor protein, comprising the signal peptide, the S1 subunit (receptor binding domain, N-terminal domain, and furin cleavage site), and the S2 subunit (fusion peptide, heptad repeats, transmembrane domain, and cytoplasmic tail). This fixed-length representation enforces positional homology by construction, allowing site indices to be interpreted consistently across sequences.

sprintf("All sequences are length 1,273: %s",
        all(width(fasta) == 1273L))
#> [1] "All sequences are length 1,273: TRUE"

5 Step 3: Extract and Validate Collection Dates

5.1 Initial Date Extraction (yyyy-mm-dd)

We first attempt full-precision date extraction (yyyy-mm-dd) to identify sequences that only carry year-month (yyyy-mm) or year-only (yyyy) dates – both common in NCBI submissions.

# Pass 1: attempt full yyyy-mm-dd extraction
dates_ymd <- extract_fasta_dates(
  fasta,
  custom_pattern = "(?<=\\|)[0-9]{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12][0-9]|3[01]|00)"
)

sprintf("Sequences with full yyyy-mm-dd dates: %d",
        sum(!is.na(dates_ymd$raw_date_strings)))
#> [1] "Sequences with full yyyy-mm-dd dates: 133864"
sprintf("Sequences missing full date:          %d",
        sum(is.na(dates_ymd$raw_date_strings)))
#> [1] "Sequences missing full date:          3268"

# Inspect headers of sequences that lack a full yyyy-mm-dd date
if (length(dates_ymd$missing_id) > 0) {
  cat("Sample headers with partial/missing dates:\n")
  cat(paste(head(names(fasta)[dates_ymd$missing_id], 5), collapse = "\n"))
}
#> Sample headers with partial/missing dates:
#> YP_009724390.1 |China|2019-12
#> UBX21225.1 |USA|2020-09
#> UBX21237.1 |USA|2020-09
#> UBX21249.1 |USA|2020-09
#> UBX21261.1 |USA|2020-09

5.2 Second Pass: yyyy-mm Extraction

For monthly partitioning, yyyy-mm precision is sufficient and more inclusive than requiring the full day. We re-extract dates in yyyy-mm format. Any sequence that only has a year (yyyy) with no month will still fail and must be removed.

# Pass 2: extract yyyy-mm (covers both yyyy-mm-dd and yyyy-mm submissions)
dates_result <- extract_fasta_dates(
  fasta,
  custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
  date_format    = "%Y-%m"
)

sprintf("Sequences with parseable yyyy-mm dates: %d", 
        sum(!is.na(dates_result$raw_date_strings)))
#> [1] "Sequences with parseable yyyy-mm dates: 136705"
sprintf("Sequences still missing a date:         %d",
        sum(is.na(dates_result$raw_date_strings)))
#> [1] "Sequences still missing a date:         427"

# Sequences with year-only dates -- these cannot be placed in a monthly window
if (length(dates_result$missing_id) > 0) {
  cat("Headers with year-only dates (to be removed):\n")
  cat(paste(head(names(fasta)[dates_result$missing_id], 5), collapse = "\n"))
}
#> Headers with year-only dates (to be removed):
#> UAU30010.1 |USA|2021
#> QZR93642.1 ||
#> QZR93654.1 ||
#> QZR93666.1 ||
#> QZR93678.1 ||

# Remove sequences with year-only dates
if (length(dates_result$missing_id) > 0) {
  fasta        <- fasta[-dates_result$missing_id]
  dates_result <- extract_fasta_dates(
    fasta,
    custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
    date_format    = "%Y-%m"
  )
}

cat("Date extraction status after removal:", dates_result$message, "\n")
#> Date extraction status after removal: All date strings have been successfully extracted
sprintf("Sequences remaining: %d", length(fasta))
#> [1] "Sequences remaining: 136705"

5.3 Date Distribution (Before Ambiguous Residue Filtering)

Visualising the date distribution helps verify that the data spans the expected surveillance window and that no anomalous date clusters are present.

dates_df_before <- data.frame(
  Date = as.Date(dates_result$corrected_dates)
) %>%
  mutate(YearMonth = format(Date, "%Y-%m"))

ggplot(dates_df_before, aes(x = YearMonth)) +
  geom_bar(stat = "count", fill = "steelblue") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8),
    plot.title  = element_text(hjust = 0.5)
  ) +
  labs(
    x     = "Collection Month",
    y     = "Number of Sequences",
    title = "Sequence Count by Collection Month (before ambiguous residue filtering)"
  )

6 Step 4: Extract and Validate Countries

# Extract country field (position 2 = between first and second pipe)
countries_result <- extract_fasta_countries(fasta, position = 2)

sprintf("Sequences with parseable country: %d", 
        sum(!is.na(countries_result$countries)))
#> [1] "Sequences with parseable country: 136702"
sprintf("Sequences with missing country:   %d",
        sum(is.na(countries_result$countries)))
#> [1] "Sequences with missing country:   3"

# Inspect headers with missing countries before removing
if (length(countries_result$missing_id) > 0) {
  cat("Sample headers with missing country field:\n")
  cat(paste(head(names(fasta)[countries_result$missing_id], 5), collapse = "\n"))
  
  # Remove sequences with missing countries
  fasta            <- fasta[-countries_result$missing_id]
  countries_result <- extract_fasta_countries(fasta, position = 2)
}
#> Sample headers with missing country field:
#> BCM16162.1 ||2020-03
#> BCA87361.1 ||2020-02-10
#> BCA87371.1 ||2020-02-10

cat("Country extraction status after removal:", countries_result$message, "\n")
#> Country extraction status after removal: All countries have been extracted
sprintf("Sequences remaining: %d", length(fasta))
#> [1] "Sequences remaining: 136702"

# Re-extract dates after removing sequences with missing countries
dates_result <- extract_fasta_dates(
  fasta,
  custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
  date_format    = "%Y-%m"
)

# Confirm no remaining missing values in either field
sprintf("Missing dates remaining:     %d", sum(is.na(dates_result$corrected_dates)))
#> [1] "Missing dates remaining:     0"
sprintf("Missing countries remaining: %d", sum(is.na(countries_result$countries)))
#> [1] "Missing countries remaining: 0"

# Country distribution -- sequences per country of collection
countries_result$countries %>%
  table() %>%
  sort(decreasing = TRUE) 
#> .
#>            USA      Australia          India          Egypt         Mexico          Japan    New Zealand     Bangladesh 
#>         119385           9571           1132            896            891            609            583            438 
#>          Chile       Pakistan      Hong Kong          China        Bahrain         Poland         Serbia   Saudi Arabia 
#>            290            229            203            167            147            147            146            144 
#>         Israel          Ghana           Peru         Greece         France          Italy        Germany          Spain 
#>            119            116            104             97             89             89             81             61 
#>      West Bank   Sierra Leone        Tunisia         Turkey         Brazil       Djibouti         Taiwan           Iraq 
#>             58             56             56             51             50             50             48             47 
#>      Argentina    South Korea          Kenya        Myanmar    Philippines         Russia Czech Republic         Jordan 
#>             34             31             30             26             24             24             23             22 
#>       Portugal      Venezuela        Georgia       Thailand    Netherlands        Austria        Uruguay     Uzbekistan 
#>             22             22             21             20             18             15             14             14 
#>        Denmark United Kingdom          Benin           Iran        Morocco         Uganda        Finland          Libya 
#>             13             13             12             12             12             12             11             11 
#>      Guatemala        Jamaica       Malaysia    Timor-Leste       Ethiopia         Gambia         Canada     Kazakhstan 
#>             10              8              8              8              6              6              5              5 
#>      Sri Lanka         Belize        Ecuador        Lebanon          Malta       Colombia         Guinea           Mali 
#>              5              4              4              4              4              2              2              2 
#>        Nigeria       Viet Nam        Belarus        Belgium       Cambodia          Nepal   South Africa         Sweden 
#>              2              2              1              1              1              1              1              1 
#>    Switzerland           Togo         Zambia 
#>              1              1              1

7 Step 5: Convert, Filter, and Encode the Sequence Matrix

7.1 Convert FASTA to Character Matrix

fasta_to_char_matrix() converts the AAStringSet object to a plain R character matrix (rows = sequences, columns = sites). This fully vectorized implementation is substantially faster than row-by-row strsplit approaches on large datasets.

char_mat <- fasta_to_char_matrix(fasta)

sprintf("Character matrix dimensions: %d sequences x %d sites",
        nrow(char_mat), ncol(char_mat))
#> [1] "Character matrix dimensions: 136702 sequences x 1273 sites"

# Inspect a small region
char_mat[1:5, 1:10]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "M"  "F"  "V"  "F"  "L"  "V"  "L"  "L"  "P"  "L"  
#> [2,] "M"  "F"  "V"  "F"  "L"  "V"  "L"  "L"  "P"  "L"  
#> [3,] "M"  "F"  "V"  "F"  "L"  "V"  "L"  "L"  "P"  "L"  
#> [4,] "M"  "F"  "V"  "F"  "L"  "V"  "L"  "L"  "P"  "L"  
#> [5,] "M"  "F"  "V"  "F"  "L"  "V"  "L"  "L"  "P"  "L"

7.2 Filter Ambiguous Residues

Ambiguous amino acid codes (B, X, Z) and unrecognised characters (encoded as 0) can distort entropy and Hellinger distance calculations by inflating apparent diversity at a site. filter_ambiguous_sequences() removes any sequence containing at least one such residue.

option = 2 specifies a character matrix input (as returned by fasta_to_char_matrix()).

filtered <- filter_ambiguous_sequences(char_mat, option = 2)

cat(filtered$OriginalDim,     "\n")
#> Number of original sequences is 136702
cat(filtered$NewDim,          "\n")
#> Number of filtered sequences is 125130
cat(filtered$NumberAmbiguous, "\n")
#> Number of sequences containing at least one of B, X, Z or J characters is 11572
cat(filtered$RangeAmbiguous,  "\n")
#> Number of ambiguous protein characters per sequence varies between 1 and 287

# Extract the clean matrix and align metadata to retained rows
clean_char_mat  <- filtered$FilteredMatrix
deleted_rows    <- filtered$DeletedSeqId

# Remove deleted rows from date and country vectors
corrected_dates <- dates_result$corrected_dates[-deleted_rows]
countries       <- countries_result$countries[-deleted_rows]

sprintf("Clean matrix: %d sequences x %d sites",
        nrow(clean_char_mat), ncol(clean_char_mat))
#> [1] "Clean matrix: 125130 sequences x 1273 sites"

7.3 Encode Amino Acids to Integers

encode_aa_sequence() maps each character to its integer code using the standard 25-character ViralEntropR alphabet. After ambiguous residue filtering only codes 1–20 (the 20 standard amino acids) should remain.

Codes	Characters	Meaning
1–20	A R N D C Q E G H I L K M F P S T W Y V	Standard amino acids
21	B	Asp/Asn ambiguity
22	Z	Glu/Gln ambiguity
23	X	Any amino acid (unknown)
24	*	Stop codon
25	-	Gap
0	(other)	Unrecognised character

int_mat <- encode_aa_sequence(clean_char_mat)

# Verify only codes 1-20 remain
code_range <- range(int_mat)
sprintf("Encoded value range: %d to %d  (expected: 1 to 20 after filtering)",
        code_range[1], code_range[2])
#> [1] "Encoded value range: 1 to 20  (expected: 1 to 20 after filtering)"

8 Step 6: Assemble the Final Data Frame

We combine the integer matrix with the aligned date and country metadata into a single data frame, standardise dates to the first of each month for monthly partitioning, and sort by date. This object is passed directly to partition_time_windows().

n_sites <- ncol(int_mat)

# Build the data frame
AL_df           <- as.data.frame(int_mat)
colnames(AL_df) <- as.character(seq_len(n_sites))
AL_df[]         <- lapply(AL_df, as.integer)

# Standardise to first of each month for consistent monthly partitioning
AL_df$Date    <- as.Date(format(corrected_dates, "%Y-%m-01"))
AL_df$Country <- countries

# Sort by date -- required by partition_time_windows()
AL_df         <- AL_df[order(AL_df$Date), ]
rownames(AL_df) <- NULL

sprintf("Final data frame: %d sequences x %d site columns + Date + Country",
        nrow(AL_df), n_sites)
#> [1] "Final data frame: 125130 sequences x 1273 site columns + Date + Country"

# Preview first few rows -- site columns and metadata
AL_df[1:10, c(1:5, n_sites + 1, n_sites + 2)]

# Country distribution in the clean final dataset
AL_df$Country %>% table() %>% sort(decreasing = TRUE)
#> .
#>            USA      Australia          India         Mexico          Egypt          Japan    New Zealand     Bangladesh 
#>         109536           8473           1007            869            797            563            551            424 
#>          Chile       Pakistan      Hong Kong          China         Serbia         Poland   Saudi Arabia        Bahrain 
#>            286            218            187            164            144            140            133            128 
#>         Israel         Greece           Peru         France          Italy        Germany          Spain      West Bank 
#>            118             97             91             89             82             75             58             54 
#>   Sierra Leone        Tunisia       Djibouti         Brazil         Taiwan           Iraq         Turkey      Argentina 
#>             52             52             50             48             48             43             43             33 
#>    South Korea          Ghana        Myanmar    Philippines         Russia         Jordan      Venezuela        Georgia 
#>             31             26             26             24             24             22             22             21 
#> Czech Republic       Thailand       Portugal    Netherlands          Kenya        Austria        Uruguay          Benin 
#>             20             20             19             17             16             15             14             12 
#>           Iran        Morocco        Finland          Libya      Guatemala     Uzbekistan        Jamaica United Kingdom 
#>             12             12             11             11             10              9              8              8 
#>         Gambia     Kazakhstan       Malaysia         Canada        Denmark        Ecuador          Malta      Sri Lanka 
#>              6              5              5              4              4              4              4              4 
#>         Belize       Ethiopia        Lebanon       Colombia         Guinea           Mali        Nigeria    Timor-Leste 
#>              3              3              3              2              2              2              2              2 
#>       Viet Nam        Belarus        Belgium       Cambodia          Nepal   South Africa         Sweden    Switzerland 
#>              2              1              1              1              1              1              1              1 
#>           Togo         Uganda         Zambia 
#>              1              1              1

8.1 Date Distribution (After Filtering)

Compare the final date distribution against the earlier plot to confirm that the filtering steps have not introduced systematic temporal bias.

ggplot(AL_df, aes(x = format(Date, "%Y-%m"))) +
  geom_bar(stat = "count", fill = "steelblue") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8),
    plot.title  = element_text(hjust = 0.5)
  ) +
  labs(
    x     = "Collection Month",
    y     = "Number of Sequences",
    title = "Sequence Count by Collection Month (clean dataset, ready for analysis)"
  )

sprintf("Date range: %s to %s",
        format(min(AL_df$Date), "%Y-%m"),
        format(max(AL_df$Date), "%Y-%m"))
#> [1] "Date range: 2019-12 to 2021-09"

sprintf("Total sequences in final dataset: %d", nrow(AL_df))
#> [1] "Total sequences in final dataset: 125130"

8.2 Filter `US` Sequences and Save

AL_df_US <- AL_df[!is.na(AL_df$Country) & AL_df$Country == "USA", ]
sprintf("Sequences after country filter (%s): %d", "USA", nrow(AL_df_US))

saveRDS(AL_df_US, file.path("data-raw", "NCBI_US_unaligned_feature_matrix_1273aa.rds"))

9 Downstream Analysis

The data frame AL_df is now ready for the full ViralEntropR pipeline. The site columns contain integer-encoded amino acids (1–20), Date contains first-of-month dates for monthly partitioning, and Country is available for geographic stratification.

# Example: partition into non-overlapping 2-month windows
part_data <- partition_time_windows(
  data          = AL_df,
  n_sites       = n_sites,
  window_length = 2,
  window_type   = 3,      # non-overlapping / jumping windows
  verbose       = FALSE
)

sprintf("Number of 2-month partitions created: %d", part_data$N_partitions)
#> [1] "Number of 2-month partitions created: 10"
cat("\nPartition date labels:\n")
#> 
#> Partition date labels:
cat(paste(part_data$Dates_Labels, collapse = "\n"))
#> Dec-2019 - Jan-2020
#> Feb-2020 - Mar-2020
#> Apr-2020 - May-2020
#> Jun-2020 - Jul-2020
#> Aug-2020 - Sep-2020
#> Oct-2020 - Nov-2020
#> Dec-2020 - Jan-2021
#> Feb-2021 - Mar-2021
#> Apr-2021 - May-2021
#> Jun-2021 - Jul-2021

# Entropy-based site clustering for the first partition
cat("\nSite clustering for partition 1 (reference period):\n")
#> 
#> Site clustering for partition 1 (reference period):
part_data$Clusters[[1]]$DataFrame

As the output shows, entropy-based analysis of early pandemic sequences already flags several sites that subsequently proved critical across multiple VOCs and VOIs:

Site 95 (T95I) – located in the N-terminal domain (NTD) of the Spike protein. The T95I substitution alters the topology of the NTD neutralization “supersite”, reducing antibody recognition, and is associated with increased viral load. It is a defining mutation of the Iota (B.1.526) variant and also appears in Delta and Kappa (Magazine et al. 2022).
Site 655 (H655Y) – located immediately upstream of the furin cleavage site at the S1/S2 boundary. The H655Y substitution enhances spike protein cleavage efficiency and viral replication, and has been described as a “gateway mutation” that promotes fitness in emerging lineages. It is a defining feature of Gamma (P.1) and Omicron (Escalera et al. 2022).
Site 677 (Q677H/P) – situated only three residues upstream of the polybasic furin cleavage site (RRAR). Mutations at this position are thought to promote conformational changes that enhance accessibility of the S1/S2 junction to proteolytic cleavage, increasing viral entry efficiency. Q677H independently evolved in at least six distinct US lineages in late 2020, a striking example of convergent evolution under positive selection (Zeng et al. 2021).

The early detection of these sites by entropy clustering – before the variants carrying them were formally designated – illustrates the prospective surveillance potential of the ViralEntropR pipeline.

# Entropy-based site clustering for the last partition
cat("\nSite clustering for partition 10 (final partition):\n")
#> 
#> Site clustering for partition 10 (final partition):
part_data$Clusters[[10]]$DataFrame

From here, pass part_data$Partitions to calculate_hellinger_matrix() to compute Hellinger distances across time windows, then transpose and apply detect_changepoints_ecp() or e.agglo() for change point detection. See the companion vignette “Entropy Clustering, Hellinger Distance, and Change Point Analysis for Emerging Viral Variant Detection: A Simulation Study” for a complete worked example.

10 Pipeline Summary

Download sequences.fasta manually from Zenodo (https://zenodo.org/records/19040165)
  |
  v
readAAStringSet()                 # Step 1: Load FASTA into R
  |
  v
Confirm width() == 1273L          # Step 2: All sequences match canonical Spike length
                                  # (YP_009724390 -- no filtering required)
  |
  v
extract_fasta_dates()             # Step 3: Pass 1 -- attempt yyyy-mm-dd
  -> re-extract with yyyy-mm      #          Pass 2 -- broader pattern, remove year-only
  |
  v
extract_fasta_countries()         # Step 4: Extract country field from header
  -> remove missing               #          Iterative: remove, re-extract, confirm
  |
  v
fasta_to_char_matrix()            # Step 5: AAStringSet -> character matrix (rows x sites)
  |
  v
filter_ambiguous_sequences()      #          Remove sequences with B, X, Z, unknown
  -> align date/country vectors   #          Drop same rows from metadata
  |
  v
encode_aa_sequence()              #          Character -> integer (25-AA alphabet)
  |
  v
Assemble data frame               #          Add Date (yyyy-mm-01) + Country, sort
  |
  v
partition_time_windows()          # Downstream: partition into time windows
  |                               #             -> entropy clustering per partition
  v
calculate_hellinger_matrix()      #             Hellinger distances (sites x partitions)
  |
  v
detect_changepoints_ecp()         #             Change point detection

References

Escalera, Anke, Ana S. Gonzalez-Reiche, Saad Aslam, Isabel Mena, Maria Laporte, Richard L. Pearl, Andrea Fossati, et al. 2022. “Mutations in SARS-CoV-2 Variants of Concern Link to Increased Spike Cleavage and Virus Transmission.” Cell Host & Microbe 30 (3): 373–387.e7. https://doi.org/10.1016/j.chom.2022.01.006.

Magazine, Nidhi, Tongai Zhang, Yuze Wu, Madeline C. McGee, Giovanni Veggiani, and Wenyan Huang. 2022. “Mutations and Evolution of the SARS-CoV-2 Spike Protein.” Viruses 14 (3): 640. https://doi.org/10.3390/v14030640.

Sayers, Eric W., Jeffrey Beck, J. Rodney Brister, Evan E. Bolton, Kathi Canese, Donald C. Comeau, Kristin Funk, et al. 2022. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 50 (D1): D20–26. https://doi.org/10.1093/nar/gkab1112.

Tyuryaev, V., H. Jankowski, and J. M. Heffernan. 2026. “SARS-CoV-2 Surface Glycoprotein Sequences, NCBI Data Hub, October 2021 (ViralEntropR Archive).” Zenodo. https://doi.org/10.5281/zenodo.19040165.

Zeng, Hongjie, Adam Miller, Jian Qin, Michael Scott, Onyekachi Amadi, Adam Bauer, Anthony L. Valesano, et al. 2021. “Emergence of a SARS-CoV-2 Variant with Mutations in Spike Glycoprotein at Position 677.” mBio 12 (6): e02764–21. https://doi.org/10.1128/mBio.02764-21.

NCBI SARS-CoV-2 Spike Protein Sequence Preprocessing: From Raw FASTA to an Analysis-Ready Integer-Encoded Matrix

Vadim Tyuryaev, Jane Heffernan, Hanna Jankowski

2026-05-08

1 Introduction

2 Setup

3 Step 1: Load the FASTA Dataset

3.1 Download Data

3.2 Update Path

3.3 Read FASTA file

3.4 Curated metadata for 12 SARS-CoV-2 VOCs and VOIs

3.5 Bundled FASTA sample

4 Step 2: Canonical Spike Protein Length

5 Step 3: Extract and Validate Collection Dates

5.1 Initial Date Extraction (yyyy-mm-dd)

5.2 Second Pass: yyyy-mm Extraction

5.3 Date Distribution (Before Ambiguous Residue Filtering)

6 Step 4: Extract and Validate Countries

7 Step 5: Convert, Filter, and Encode the Sequence Matrix

7.1 Convert FASTA to Character Matrix

7.2 Filter Ambiguous Residues

7.3 Encode Amino Acids to Integers

8 Step 6: Assemble the Final Data Frame

8.1 Date Distribution (After Filtering)

8.2 Filter `US` Sequences and Save

9 Downstream Analysis

10 Pipeline Summary

References

NCBI SARS-CoV-2 Spike Protein Sequence Preprocessing: From Raw FASTA to an Analysis-Ready Integer-Encoded Matrix

Vadim Tyuryaev, Jane Heffernan, Hanna Jankowski

2026-05-08

1 Introduction

2 Setup

3 Step 1: Load the FASTA Dataset

3.1 Download Data

3.2 Update Path

3.3 Read FASTA file

3.4 Curated metadata for 12 SARS-CoV-2 VOCs and VOIs

3.5 Bundled FASTA sample

4 Step 2: Canonical Spike Protein Length

5 Step 3: Extract and Validate Collection Dates

5.1 Initial Date Extraction (yyyy-mm-dd)

5.2 Second Pass: yyyy-mm Extraction

5.3 Date Distribution (Before Ambiguous Residue Filtering)

6 Step 4: Extract and Validate Countries

7 Step 5: Convert, Filter, and Encode the Sequence Matrix

7.1 Convert FASTA to Character Matrix

7.2 Filter Ambiguous Residues

7.3 Encode Amino Acids to Integers

8 Step 6: Assemble the Final Data Frame

8.1 Date Distribution (After Filtering)

8.2 Filter US Sequences and Save

9 Downstream Analysis

10 Pipeline Summary

References

8.2 Filter `US` Sequences and Save