Spelling approximate repeated or
common motifs using a suffix tree
Marie-France Sagot
in Latin'98, Lecture Notes in Computer
Science,
vol. 1380, pages 111-127, Springer Verlag, 1998
We present in this paper two algorithms. The first one extracts
repeated motifs from a sequence defined over an alphabet
Sigma. For instance, Sigma may be equal to {A, C, G, T}
and the sequence represent an encoding of a DNA macromolecule. The
motifs searched correspond to words over the same alphabet which occur
a minimum number q of times in the sequence with at most
e mismatches each time (q is called the quorum
constraint). The second algorithm extracts common motifs from a set
of N >= 2 sequences. In this case, the motifs must occur, again
with at most e mismatches, in 1 <= q <= N distinct
sequences of the set. In both cases, the words representing the
motifs may never be present exactly in the sequences. We therefore
speak of the motifs, repeated in a sequence or common to a set of
them, as being "external" objects and denote them by the expression
"valid models" if they verify the quorum constraint q. The
approach we introduce here for finding all valid models corresponding
to either repeated or common motifs starts by building a suffix tree
of the sequence(s) and then, after some further preprocessing, uses
this tree to simply "spell" the models. Assuming an alphabet of fixed
size, the total time needed is O(n N^2 V(e,k)) using O(n N^2
/ w) space, where n is the (average) length of the
sequence(s), k is the length of the models sought or is the
length of the longest possible valid models, w is the size of a
word machine and V(e,k) is the number of words of length
k at a Hamming distance at most e from another
k-length word. V(e,k) may be majored by k^e
|Sigma|^e. This improves on an algorithm by Waterman (Waterman,
1984). It is also a better time bound than our previous approach
(Sagot, 1995) for the common motifs problem whenever N < k
|Sigma|, and a better space bound when N/w < k. It is a
better time and space bound in absolute for the repeated motifs
problem. The complexities obtained in this second case are O(n V(e,
k)) and O(n) respectively. Finally, we suggest how to
extend these algorithms to deal with gaps.
Paper in postscript format
Back to the Publications page