Cassis: Detection of genomic rearrangement breakpoints


Comparing Cassis with other methods for detecting breakpoints

     Cassis was used to process two different data sets obtained from Ensembl: a list of one-to-one orthologous genes and a list of orthologous synteny blocks which can be found in the genomes of the species Homo sapiens and Mus musculus. Cassis was also used to process a list of LCBs produced by Mauve for the genomes of the species Homo sapiens and Mus musculus.

     To facilitate the discussion, we adopted the following names for the different data sets:

     Table 1 shows the minimum, maximum, mean and median breakpoint lengths obtained by Cassis, before and after the segmentation step, while processing the sets G, B and M.

Table 1: Minimum, maximum, mean and median breakpoint length before and after the segmentation obtained for each one of the different testing sets.
Description N Before segmentation [length in bp] After segmentation [length in bp]
Min Max Mean Median Min Max Mean Median
G - Genes 369 1 32,752,838 771,401 244,568 21 5,133,352 201,251 52,986
B - Synteny blocks 234 1 15,172,525 463,422 114,679 21 4,238,335 163,619 65,865
M - Mauve's LCBs 649 2 29,940,399 311,898 59,543 6 11,577,359 177,551 51,471

     If we look at the breakpoint lengths before the segmentation step, we can see that set G has an average length bigger than the ones observed with the other two sets. This is explained by the fact that, when Cassis uses a list of genes to define the breakpoints, it anchors their boundaries on the orthologous genes. In the sets B and G, the limits of the breakpoints are obtained based on information about sequence similarity (which is not anchored on the extremities of the genes).

     As a consequence of this, we can see in Table 1 that, before the segmentation, set G has breakpoints which are substantially bigger than the ones which are obtained with the other two sets. However, we can see also that after the segmentation step, the mean and, specially, the median length of the breakpoints in the set G are much closer to the ones obtained in sets B and M.

     Due to the fact that the sets B and M have smaller breakpoints than set G, it was to be expected that the number of breakpoints which cannot be narrowed by Cassis in these sets should be bigger than the number for set G (see Table 2).

Table 2: Minimum, maximum, mean and median length of the breakpoints which had their length increased after the segmentation on each one of the testing sets.
Description N Before segmentation [length in bp] After segmentation [length in bp]
Min Max Mean Median Min Max Mean Median
G - Genes 25 1 5,115,573 282,827 34,067 1,539 5,133,352 311,471 51,645
B - Synteny blocks 83 1 543,101 70,602 24,923 21 543,249 77,229 33,298
M - Mauve's LCBs 118 49 1,280,914 49,899 11,688 1,323 1,298,477 100,421 78,253

     Out of the 83 breakpoints from Compara blocks which were not reduced by Cassis, only 28 have their length extended by more than 1 kbp after the Cassis refinement step. Thus, only for a minority of the Compara breakpoints does Cassis worsen substantially the resolution of breakpoints (see the histogram of the breakpoints size differences).

     Concerning the Mauve breakpoints, still 113 out of the 118 breakpoints have their length extended by more than 1 kbp (see the histogram of the breakpoints size differences). However, we can note that almost half of them (52) are defined at least on one side by a LCB less than 10 kbp in size, suggesting that these breakpoints may not be reliable.

     On the other hand, since the synteny blocks (or LCBs) are built based on information about sequence similarity, it was also to be expected that Cassis would not be able to refine a large number of the breakpoints in sets B and M. However, we can see in Table 3 that most of the breakpoints of the sets B and M were refined by Cassis.

Table 3: Minimum, maximum, mean and median length of the breakpoints which had their length decreased after the segmentation on each one of the testing sets.
Description N Before segmentation [length in bp] After segmentation [length in bp] Length reduction [%]
Min Max Mean Median Min Max Mean Median Min Max Mean Median
G - Genes 340 303 32,752,838 811,739 256,577 21 4,860,908 195,085 54,883 0.31 99.99 62.26 71.20
B - Synteny blocks 124 848 15,172,525 792,470 251,879 174 4,238,335 223,386 73,567 0.25 99.86 57.30 61.32
M - Mauve's LCBs 464 2,143 29,940,399 416,414 98,389 6 11,577,359 217,385 48,285 0.01 99.98 41.22 35.02

     By definition, the blocks produced by Mauve do not overlap themselves. This is a good property when we desire to use a list of synteny blocks as input for Cassis. For example, of the 345 synteny blocks extracted from the Compara database, 88 of them overlap in the mouse genome. Based on the 345 blocks, Cassis identified 292 breakpoints, but 58 of them could not be processed by the segmentation step because their lengths are smaller that the allowed limit (evidence of big overlapping).


4 - Download