### Spelling approximate repeated or
common motifs using a suffix tree

####
**Marie-France Sagot**

in Latin'98, Lecture Notes in Computer
Science,

vol. 1380, pages 111-127, Springer Verlag, 1998

We present in this paper two algorithms. The first one extracts
repeated motifs from a sequence defined over an alphabet
*Sigma*. For instance, *Sigma* may be equal to {A, C, G, T}
and the sequence represent an encoding of a DNA macromolecule. The
motifs searched correspond to words over the same alphabet which occur
a minimum number *q* of times in the sequence with at most
*e* mismatches each time (*q* is called the quorum
constraint). The second algorithm extracts common motifs from a set
of *N >= 2* sequences. In this case, the motifs must occur, again
with at most *e* mismatches, in *1 <= q <= N* distinct
sequences of the set. In both cases, the words representing the
motifs may never be present exactly in the sequences. We therefore
speak of the motifs, repeated in a sequence or common to a set of
them, as being "external" objects and denote them by the expression
"valid models" if they verify the quorum constraint *q*. The
approach we introduce here for finding all valid models corresponding
to either repeated or common motifs starts by building a suffix tree
of the sequence(s) and then, after some further preprocessing, uses
this tree to simply "spell" the models. Assuming an alphabet of fixed
size, the total time needed is *O(n N^2 V(e,k))* using *O(n N^2
/ w)* space, where *n* is the (average) length of the
sequence(s), *k* is the length of the models sought or is the
length of the longest possible valid models, *w* is the size of a
word machine and *V(e,k)* is the number of words of length
*k* at a Hamming distance at most *e* from another
*k*-length word. *V(e,k)* may be majored by *k^e
|Sigma|^e*. This improves on an algorithm by Waterman (Waterman,
1984). It is also a better time bound than our previous approach
(Sagot, 1995) for the common motifs problem whenever *N < k
|Sigma|*, and a better space bound when *N/w < k*. It is a
better time and space bound in absolute for the repeated motifs
problem. The complexities obtained in this second case are *O(n V(e,
k))* and *O(n)* respectively. Finally, we suggest how to
extend these algorithms to deal with gaps.

Paper in postscript format

Back to the *Publications* page