Spelling approximate repeated or common motifs using a suffix tree


Marie-France Sagot
in Latin'98, Lecture Notes in Computer Science,
vol. 1380, pages 111-127, Springer Verlag, 1998

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Sigma. For instance, Sigma may be equal to {A, C, G, T} and the sequence represent an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N >= 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 <= q <= N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being "external" objects and denote them by the expression "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply "spell" the models. Assuming an alphabet of fixed size, the total time needed is O(n N^2 V(e,k)) using O(n N^2 / w) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e,k) is the number of words of length k at a Hamming distance at most e from another k-length word. V(e,k) may be majored by k^e |Sigma|^e. This improves on an algorithm by Waterman (Waterman, 1984). It is also a better time bound than our previous approach (Sagot, 1995) for the common motifs problem whenever N < k |Sigma|, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(n V(e, k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

Paper in postscript format
Back to the Publications page