MicroSEC: Sequence artifact filtering pipeline for FFPE samples

Masachika Ikegami

2024-08-25

Introduction

This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.

File 1: mutation information file (mandatory)

This tsv or tsv.gz file should contain at least these columns: Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence
sample_name 1-snv chr1 108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or not (“Y” or “N”).
Neighborhood_sequence: [5’-20 bases] + [Alt sequence] + [3’-20 bases]
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
If you do not know the SimpleRepeat_TRF, Mut_type, or Neighborhood_sequence, enter “-”. Automatically detected.

File 2: BAM file (mandatory)

File 3: sample information tsv file

(mandatory, if multiple samples are processed in a batch)
Seven to ten columns are necessary (without column names).
Optional columns can be deleted if they are not applicable.
[sample name] [mutation information tsv file] [BAM file] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [sample type: Human or Mouse] [panel name] [optional: reference genome fastq file] [optional: simple repeat region bed file]
PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP
A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa ./reference/simpleRepeat.bed.gz

File 4: Reference genome: Human (hg38 or hg19) or Mouse (mm10)

(optional, but mandatory with cram files)

File 5: simple repeat region bed file

(optional, but mandatory to detect simple repeat derived artifacts)

This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too unevenly to occur (1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 1-2: The shorter-supporting lengths distributed over less than 75% of the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 and 2-2).
Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to occur (3-1 and 3-3).
Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within 75% of the read length.
Filter 4 : >=15% mutations were called by chimeric reads comprising two distant regions.
Filter 5 : >=50% mutations were called by soft-clipped reads.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : Mutations locating at a >=15 homopolymer.
Filter 8 : >=10% low quality bases (Quality score <18) in the mutation supporting reads.

Supporting lengths are adjusted considering small repeat sequences around the mutations.

How to use

Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]

Example

$ Rscript MicroSEC.R ~ \
  ~/source/Sample_list_test.txt N
$ Rscript MicroSEC.R ~ \
   ~/source/sample_info_test.tsv.gz Y

If you want to know the progress visually, [progress bar Y/N] should be Y.
Results are saved in a tsv file.

github url: https://github.com/MANO-B/MicroSEC

Setting

wd <- "~"
knitr::opts_chunk$set(collapse = TRUE,
                      fig.width = 12,
                      fig.height = 8,
                      echo = TRUE,
                      warning = FALSE,
                      message = TRUE,
                      comment = "#>")
options(rmarkdown.html_vignette.check_title = FALSE,
        show.error.messages = FALSE,
        warn = -1)

progress_bar <- "N"

Necessary packages

library(MicroSEC)

Analysis

# initialize
msec <- NULL
homology_search <- NULL
mut_depth <- NULL

# test data
sample_name <- "sample"
read_length <- 150
adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism <- "hg38"

# load mutation information
df_mutation <- fun_load_mutation(
   system.file("extdata", "mutation_list.tsv", package = "MicroSEC"),
   "sample",
   BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
   24)
df_bam <- fun_load_bam(
   system.file("extdata", "sample.bam", package = "MicroSEC"))

# another example data
# data(exampleMutation)
# data(exampleBam)
# df_mutation <- exampleMutation
# df_bam <- exampleBam

# load genomic sequence
ref_genome <- fun_load_genome(organism)
chr_no <- fun_load_chr_no(organism)

# analysis
result <- fun_read_check(df_mutation = df_mutation,
                         df_bam = df_bam,
                         ref_genome = ref_genome,
                         sample_name = sample_name,
                         read_length = read_length,
                         adapter_1 = adapter_1,
                         adapter_2 = adapter_2,
                         short_homology_search_length = 4,
                         min_homology_search = 40,
                         progress_bar = progress_bar)
msec_read_checked <- result[[1]]
homology_searched <- result[[2]]
mut_depth_checked <- result[[3]]

# search homologous sequences
msec_homology_searched = fun_homology(msec = msec_read_checked,
                    df_distant = homology_searched,
                    min_homology_search = 40,
                    ref_genome = ref_genome,
                    chr_no = chr_no,
                    progress_bar = progress_bar)

# statistical analysis
msec_summarized <- fun_summary(msec_homology_searched)
msec_analyzed <- fun_analysis(msec = msec_summarized,
                    mut_depth = mut_depth_checked,
                    short_homology_search_length = 4,
                    min_homology_search = 40,
                    threshold_p = 10 ^ (-6),
                    threshold_hairpin_ratio = 0.50,
                    threshold_short_length = 0.75,
                    threshold_distant_homology = 0.15,
                    threshold_soft_clip_ratio = 0.50,
                    threshold_low_quality_rate = 0.1,
                    homopolymer_length = 15)

# save the results as a tsv.gz file.
#fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")

Results

msec_analyzed
#>    Sample Mut_type   Chr       Pos Ref Alt SimpleRepeat_TRF
#> 1  sample    2-snv chr11  17719997  CC  GG                N
#> 2  sample    2-snv chr11  69295601  CT  AA                N
#> 3  sample    2-snv chr15  98707826  TC  CG                N
#> 4  sample    1-ins  chr2  47783391   C  CA                N
#> 5  sample    2-del  chr2  60920434 CAG   C                N
#> 6  sample    1-snv  chr3  72446880   G   C                N
#> 7  sample    2-snv  chr3 179198865 CAC AAG                N
#> 8  sample    2-del  chr3 181742887 TTA   T                Y
#> 9  sample    1-snv  chr5 180618885   G   A                N
#> 10 sample    2-snv  chr6  29725282  CC  GG                N
#> 11 sample    2-snv  chr6  88143884  GA  TC                N
#> 12 sample    2-snv  chr7  64068966  CG  AA                N
#> 13 sample    2-snv  chr8  23681315  GC  CA                N
#>                          Neighborhood_sequence read_length total_read
#> 1   CCCCGCGGCGGTGCACCCGGGGCCGGGCGCACGTGAGGACGA         150          9
#> 2   CTTGTCTGTAGTCTGTCCCCAATGGGGACAGGGACTCGTTGC         150         28
#> 3   CAACTACGCCCTGGTCATCTCGGAGATGACCAATCTCAAGGA         150         17
#> 4   GGATGCGGCCTGGAGCGAGGCATGGGCCTGGGCCCAGGCCCT         150         13
#> 5    TGGTCTTGAACTCCTGACATCGTGATCCACCCACCTTGGCC         150         34
#> 6    CTCGCACACCAGGCGGTGGCCGCGGCGCCCCGGGAAGGGCC         150         16
#> 7  CAGGTGAACTGTGGGGCATCAAGTTGATGCCCCCAAGAATCCT         150         23
#> 8    GGGGGAAAGGACAGAGCAGTTTATATATATATATATATATA         150          6
#> 9    AGCCTGGCGAGCTCCACCATAGCGCGGAAGCGTCCGCGCTG         150         19
#> 10  ATGTGCAGCACGAGGGGCTGGGCCAGCCCCTCATCCTGAGAT         150          9
#> 11  CCTCGGCAGACGTGTCTGTGTCCACAGACATGGTTACCTTGG         150         23
#> 12  GCCTGGACTCTGCTCAGCAGAATTTGTATAGGGATGTGATGT         150         11
#> 13  CCAGGGAGGCCCGGGAGAAGCACTCCTCTTTCAGGGCCGGCA         150         34
#>    soft_clipped_read flag_hairpin pre_support_length post_support_length
#> 1                  3            8                143                  10
#> 2                 13           28                139                 140
#> 3                  7           17                139                 140
#> 4                 11           12                131                  20
#> 5                 16            0                 65                 144
#> 6                 16            0                138                   5
#> 7                  7           23                139                 141
#> 8                  1            0                100                 111
#> 9                 12            0                145                  19
#> 10                 4            9                139                 140
#> 11                 8           21                144                  10
#> 12                 7            0                 47                 131
#> 13                34           32                135                   9
#>    short_support_length pre_farthest post_farthest
#> 1                    10          205            10
#> 2                     9          242           282
#> 3                    10          338           187
#> 4                    20          211            20
#> 5                    65           65           258
#> 6                     5          190             5
#> 7                    10          268           383
#> 8                    72          114           111
#> 9                    19          158            19
#> 10                   10          205           155
#> 11                   10          393            10
#> 12                   47           47           243
#> 13                    9          301             9
#>    low_quality_base_rate_under_q18 low_quality_pre low_quality_post
#> 1                      0.015555556     0.011111111      0.000000000
#> 2                      0.006428571     0.010714286      0.003571429
#> 3                      0.009803922     0.005882353      0.005882353
#> 4                      0.010769231     0.015384615      0.000000000
#> 5                      0.003725490     0.000000000      0.000000000
#> 6                      0.009583333     0.018750000      0.012500000
#> 7                      0.005217391     0.000000000      0.004347826
#> 8                      0.091111111     0.000000000      0.166666667
#> 9                      0.009824561     0.021052632      0.000000000
#> 10                     0.004444444     0.000000000      0.000000000
#> 11                     0.021449275     0.004347826      0.000000000
#> 12                     0.020000000     0.009090909      0.027272727
#> 13                     0.006862745     0.000000000      0.011764706
#>    distant_homology_rate soft_clipped_rate prob_filter_1 prob_filter_3_pre
#> 1             0.00000000         0.3333333  5.080526e-05      1.173347e-03
#> 2             0.00000000         0.4642857  6.775881e-22      5.890075e-01
#> 3             0.00000000         0.4117647  1.362376e-19      4.838936e-01
#> 4             0.07692308         0.8461538  3.198724e-11      1.943905e-06
#> 5             1.00000000         0.4705882  9.553664e-03      1.214764e-04
#> 6             0.00000000         1.0000000  3.370855e-15      7.600607e-06
#> 7             0.00000000         0.3043478  2.439900e-13      1.000000e+00
#> 8             0.33333333         0.1666667  1.438974e-04      5.013675e-01
#> 9             0.00000000         0.6315789  2.128166e-08      2.319433e-12
#> 10            0.00000000         0.4444444  5.263540e-10      6.122053e-01
#> 11            0.00000000         0.3478261  1.667818e-18      7.358585e-10
#> 12            0.27272727         0.6363636  9.045576e-03      2.661244e-03
#> 13            0.23529412         1.0000000  2.853725e-24      2.396907e-12
#>    prob_filter_3_post filter_1_mutation_intra_hairpin_loop
#> 1        2.136914e-05                                FALSE
#> 2        8.042059e-01                                 TRUE
#> 3        6.816295e-01                                 TRUE
#> 4        8.192000e-10                                 TRUE
#> 5        1.042664e-04                                FALSE
#> 6        3.552714e-15                                 TRUE
#> 7        1.000000e+00                                 TRUE
#> 8        1.562500e-02                                FALSE
#> 9        3.876247e-12                                 TRUE
#> 10       7.240429e-01                                 TRUE
#> 11       3.030645e-20                                 TRUE
#> 12       3.880853e-03                                FALSE
#> 13       2.295862e-28                                 TRUE
#>    filter_2_hairpin_structure filter_3_microhomology_induced_mutation
#> 1                        TRUE                                   FALSE
#> 2                        TRUE                                   FALSE
#> 3                        TRUE                                   FALSE
#> 4                        TRUE                                    TRUE
#> 5                       FALSE                                   FALSE
#> 6                       FALSE                                    TRUE
#> 7                        TRUE                                   FALSE
#> 8                       FALSE                                   FALSE
#> 9                       FALSE                                    TRUE
#> 10                       TRUE                                   FALSE
#> 11                       TRUE                                    TRUE
#> 12                      FALSE                                   FALSE
#> 13                       TRUE                                    TRUE
#>    filter_4_highly_homologous_region filter_5_soft_clipped_reads
#> 1                              FALSE                       FALSE
#> 2                              FALSE                       FALSE
#> 3                              FALSE                       FALSE
#> 4                              FALSE                        TRUE
#> 5                               TRUE                       FALSE
#> 6                              FALSE                        TRUE
#> 7                              FALSE                       FALSE
#> 8                              FALSE                       FALSE
#> 9                              FALSE                        TRUE
#> 10                             FALSE                       FALSE
#> 11                             FALSE                       FALSE
#> 12                              TRUE                        TRUE
#> 13                              TRUE                        TRUE
#>    filter_6_simple_repeat filter_7_mutation_at_homopolymer filter_8_low_quality
#> 1                   FALSE                            FALSE                FALSE
#> 2                   FALSE                            FALSE                FALSE
#> 3                   FALSE                            FALSE                FALSE
#> 4                   FALSE                            FALSE                FALSE
#> 5                   FALSE                            FALSE                FALSE
#> 6                   FALSE                            FALSE                FALSE
#> 7                   FALSE                            FALSE                FALSE
#> 8                    TRUE                            FALSE                 TRUE
#> 9                   FALSE                            FALSE                FALSE
#> 10                  FALSE                            FALSE                FALSE
#> 11                  FALSE                            FALSE                FALSE
#> 12                  FALSE                            FALSE                FALSE
#> 13                  FALSE                            FALSE                FALSE
#>        msec_filter_123    msec_filter_1234     msec_filter_all
#> 1  Artifact suspicious Artifact suspicious Artifact suspicious
#> 2  Artifact suspicious Artifact suspicious Artifact suspicious
#> 3  Artifact suspicious Artifact suspicious Artifact suspicious
#> 4  Artifact suspicious Artifact suspicious Artifact suspicious
#> 5                      Artifact suspicious Artifact suspicious
#> 6  Artifact suspicious Artifact suspicious Artifact suspicious
#> 7  Artifact suspicious Artifact suspicious Artifact suspicious
#> 8                                          Artifact suspicious
#> 9  Artifact suspicious Artifact suspicious Artifact suspicious
#> 10 Artifact suspicious Artifact suspicious Artifact suspicious
#> 11 Artifact suspicious Artifact suspicious Artifact suspicious
#> 12                     Artifact suspicious Artifact suspicious
#> 13 Artifact suspicious Artifact suspicious Artifact suspicious
#>                                                                       comment
#> 1                                                                            
#> 2                                                                            
#> 3                                                                            
#> 4                                                                            
#> 5                                                                            
#> 6                                                                            
#> 7                                                                            
#> 8   too repetitive to analyze homology, filter 4: sequence is too repetitive,
#> 9                                                                            
#> 10                                                                           
#> 11                                                                           
#> 12                                                                           
#> 13

Information about the current R session

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Asia/Tokyo
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MicroSEC_2.1.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9                        utf8_1.2.4                       
#>  [3] generics_0.1.3                    SparseArray_1.2.4                
#>  [5] bitops_1.0-7                      lattice_0.22-6                   
#>  [7] stringi_1.8.3                     digest_0.6.35                    
#>  [9] magrittr_2.0.3                    grid_4.3.2                       
#> [11] evaluate_0.23                     fastmap_1.1.1                    
#> [13] Matrix_1.6-5                      jsonlite_1.8.8                   
#> [15] restfulr_0.0.15                   GenomeInfoDb_1.38.8              
#> [17] fansi_1.0.6                       BSgenome_1.70.2                  
#> [19] XML_3.99-0.16.1                   Biostrings_2.70.3                
#> [21] codetools_0.2-20                  jquerylib_0.1.4                  
#> [23] abind_1.4-5                       cli_3.6.2                        
#> [25] rlang_1.1.3                       crayon_1.5.2                     
#> [27] XVector_0.42.0                    Biobase_2.62.0                   
#> [29] withr_3.0.0                       DelayedArray_0.28.0              
#> [31] cachem_1.0.8                      yaml_2.3.8                       
#> [33] S4Arrays_1.2.1                    tools_4.3.2                      
#> [35] parallel_4.3.2                    BiocParallel_1.36.0              
#> [37] dplyr_1.1.4                       GenomeInfoDbData_1.2.11          
#> [39] Rsamtools_2.18.0                  SummarizedExperiment_1.32.0      
#> [41] BiocGenerics_0.48.1               vctrs_0.6.5                      
#> [43] R6_2.5.1                          matrixStats_1.3.0                
#> [45] BiocIO_1.12.0                     stats4_4.3.2                     
#> [47] lifecycle_1.0.4                   BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [49] rtracklayer_1.62.0                zlibbioc_1.48.2                  
#> [51] stringr_1.5.1                     S4Vectors_0.40.2                 
#> [53] IRanges_2.36.0                    pkgconfig_2.0.3                  
#> [55] pillar_1.9.0                      bslib_0.8.0                      
#> [57] glue_1.7.0                        GenomicAlignments_1.38.2         
#> [59] xfun_0.43                         tibble_3.2.1                     
#> [61] GenomicRanges_1.54.1              tidyselect_1.2.1                 
#> [63] MatrixGenerics_1.14.0             rstudioapi_0.16.0                
#> [65] knitr_1.46                        rjson_0.2.21                     
#> [67] htmltools_0.5.8.1                 rmarkdown_2.26                   
#> [69] compiler_4.3.2                    RCurl_1.98-1.14