Cassis: Detection of genomic rearrangement breakpoints


Using Cassis with LCBs produced by Mauve

1 - Input data

     Mauve is able to work with multiple genomes but Cassis works with just a pair of genomes. Thus, to perform a comparative study, we decided to use Mauve to obtain the list of LCBs which can be found in the genomes of Homo sapiens and Mus musculus.

     To do this, we adopted the same parameters used by Mauve's authors in the study of the human, mouse and rat genomes.

     We produced two FASTA files (one for the human and other for the mouse genome) containing the sequences of all chromosomes (except chromosome Y) and we executed the following command line:

mauveAligner --output=output.mauve --no-recursion --seed-size=31 --weigth=90 Human.fasta Human.sml Mouse.fasta Mouse.sml

     Mauve concatenates the sequences that are inside of a FASTA file into a single sequence. Thus, it performed the analysis of two sequences with, respectively, 3,036,303,846 bp for the human genome and 2,638,992,663 bp for the mouse genome. The parameter --seed-size defines the minimum anchor length and the parameter --weight determines the minimum weight for the LCBs. The parameter --no-recursion turns off the full alignment of the LCBs. Finally, the files Human.sml and Mouse.sml correspond to the Sorted Mer Lists which are used by the algorithm (if they do not exist, the program creates them).

     In the output produced by Mauve (file output.mauve), we found the definition of 2022 intervals. From this set of intervals, we could identify 674 LCBs which have fragments that are shared by both genomes. From this set of LCBs, we discarded two LCBs which contain the intervals of different chromosomes in the same genome: one LCB contains the end of chromosome 10 and the beginning of chromosome 11 and the other contains the end of chromosome 14 and the beginning of chromosome 15 (both in the human genome). Table 1 summarises the minimum, maximum, mean and median length of these LCBs.

Table 1: LCBs found by Mauve in the Homo sapiens and Mus musculus genomes
Description N Homo sapiens [length in bp] Mus musculus [length in bp]
Min Max Mean Median Min Max Mean Median
"Good" LCBs 672 98 58,978,356 3,964,687 717,051 98 48,176,114 3,588,420 545,930
Discarded 2 17,532,868 71,148,726 44,340,797 44,340,797 15,457,004 43,737,829 29,597,416 29,597,416

     To perform a comparison, Table 2 shows the minimum, maximum, mean and median length of the synteny blocks extracted from the Compara database of Ensembl.

Table 2: Homo sapiens and Mus musculus synteny blocks extracted from the Ensembl's Compara database
Description N Homo sapiens [length in bp] Mus musculus [length in bp]
Min Max Mean Median Min Max Mean Median
Compara database 345 108,946 58,118,452 7,844,447 3,697,691 102,552 52,624,330 7,077,399 3,172,511

     Mauve has almost twice more blocks than Compara and they are smaller than the ones from Compara. Moreover some blocks from Mauve are very small, with 111 (16.5%) blocks less than 5 Kbp.


2 - Results

     Cassis was used to process the list of 672 LCBs which are defined by Mauve. Our package identified a total of 649 breakpoints and all of them were processed by the segmentation step. Table 3 shows the minimum, maximum, mean and median breakpoint length before and after the segmentation process. It also shows the number of breakpoints which received status 1 or 0 (respectively, passed or failed in the statistical test that evaluates the significance of the segmentation).

Table 3: Minimum, maximum, mean and median breakpoint length before and after the segmentation
Description N Before segmentation [length in bp] After segmentation [length in bp]
Min Max Mean Median Min Max Mean Median
All breakpoints 649 2 29,940,399 311,898 59,543 6 11,577,359 177,551 51,471
Status = 1 582 49 29,940,399 342,103 68,791 6 11,577,359 193,671 59,525
Status = 0 67 2 699,084 49,516 12,765 78 697,008 37,527 14,388

     From the 582 breakpoints which were classified with status 1, a total of 118 breakpoints had their length increased after the segmentation. Table 4 shows the minimum, maximum, mean and median length of these breakpoints, before and after the segmentation process.

Table 4: Minimum, maximum, mean and median length of the 118 breakpoints which had their length increased after the segmentation
Description N Before segmentation [length in bp] After segmentation [length in bp]
Min Max Mean Median Min Max Mean Median
Length (after) ≥ Length (before) 118 49 1,280,914 49,899 11,688 1,323 1,298,477 100,421 78,253

     A total of 464 breakpoints were classified with status 1 and had their length decreased after the segmentation process. Table 5 shows the minimum, maximum, mean and median length of these breakpoints, before and after the segmentation process.

Table 5: Minimum, maximum, mean and median length of the 464 breakpoints which had their length decreased after the segmentation
Description N Before segmentation [length in bp] After segmentation [length in bp] Length reduction [%]
Min Max Mean Median Min Max Mean Median Min Max Mean Median
Length (after) < Length (before) 464 2,143 29,940,399 416,414 98,389 6 11,577,359 217,385 48,285 0.01 99.98 41.22 35.02

     Figure 1 shows the histogram of the differences of the breakpoint lengths, before and after the segmentation step, for the 579 breakpoints which were classified with status 1.

Figure 1: Histogram of the breakpoint size differences between "before" and "after" the refinement step of Cassis applied to Mauve blocks (LCB).
(3 values larger than 2 Mbp have been removed for the purpose of readability)



3 - Download