Identifying satellites and periodic repanditions in biological sequences

Gene Myers and Marie-France Sagot
Journal of Computational Biology, 5(3):539-554, 1998

We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to epsilon = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2 epsilon from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10^4 when epsilon = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only O(n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O(n * N(e,k)) in the worst case where k is the length of the repeating unit m, e = int(epsilon * k) is the number of differences allowed between each repeat unit and the model m, and N(e,k) is the maximum number of words that are not more than e differences from another word of length k. That is, N(e,k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.

key words: DNA satellites, tandem repeats, approximate match, consensus model, direct/inverse repeats

Paper in postscript format
Back to the Publications page