Identifying satellites and
periodic repanditions in biological sequences
Gene Myers and Marie-France Sagot
Journal of Computational Biology, 5(3):539-554, 1998
We present in this paper an algorithm for identifying satellites in
DNA sequences. Satellites (simple, micro, or mini) are repeats in
number between 30 and as many as 1,000,000 whose lengths vary between
2 and hundreds of base pairs and that appear, with some mutations, in
tandem along the sequence. We concentrate here on short to
moderately long (up to 30-40 base pairs) approximate tandem repeats
where copies may differ up to epsilon = 15-20% from a consensus
model of the repeating unit (implying individual units may vary by 2
epsilon from each other). The algorithm is composed of two
parts. The first one consists of a filter that basically eliminates
all regions whose probability of containing a satellite is less than
one in 10^4 when epsilon = 10%. The second part realizes an
exhaustive exploration of the space of all possible models for the
repeating units present in the sequence. It therefore has the
advantage over previous work of being able to report a consensus
model, say m, of the repeated unit as well as the span of the
satellite. The first phase was designed for efficiency and takes only
O(n) time where n is the length of the sequence. The
second phase was designed for sensitivity and takes time O(n *
N(e,k)) in the worst case where k is the length of the
repeating unit m, e = int(epsilon * k) is the number of
differences allowed between each repeat unit and the model m,
and N(e,k) is the maximum number of words that are not more
than e differences from another word of length k. That
is, N(e,k) is the maximum size of an e-neighborhood of a
string of length k. Experiments reveal the second phase to be
considerably faster in practice than the worst-case complexity bound
suggests. Finally, the present algorithm is easily adapted to finding
tandem repeats in protein sequences, as well as extended to
identifying mixed direct-inverse tandem repeats.
key words: DNA satellites, tandem repeats, approximate match,
consensus model, direct/inverse repeats
Paper in postscript format
Back to the Publications page