### A distance-based block searching
algorithm

####
**Marie-France Sagot, Alain Viari and Henri Soldano**

in Third
International Symposium on Intelligent Systems for Molecular Biology,

Cambridge, United Kingdom, pages 322-331, AAAI Press,
1995

We present in this paper an algorithm for the multiple comparison of a
set of protein sequences. Our approach is that of peptide matching and
consists in looking for all the words that occur approximatively in at
least *q* of the sequences in the set, where *q* is a
parameter. Words are compared by using a reference object called a
model, that is itself a word over the alphabet of the amino acids, and
the comparison between a model and a word is based on *w*-length
words instead of single symbols. This idea is similar to the one used
in the Blast program in the case of pairwise comparisons. Two
*w*-length words are considered to be related if an alignment
without gaps of the two using a similarity matrix has a score greater
than a certain threshold value *t*. In our case, we say that a
*k*-length word *u* is an occurrence of a model *m* of
the same length if every *w*-length subword of *u* is
related to the corresponding subword of *m* in the sense given
above. If a model *m* has occurrences in at least *q* of the
sequences of the set, *m* is said to occur in the set. In
percentage terms, the value of *q* may correspond to something as
small as 5% of the sequences (search for recurrent words in a set of
non homologuous proteins) or as high as 70-100% (establishment of a
list of all similar words as a first step in a multiple alignment
program). The algorithm presented here is an efficient and exact way
of looking for all the models, of a fixed length *k* or of the
greatest possible length *k_max*, that occur in a set of
sequences. It can work with any kind of scoring matrix and an
extension of the algorithm allows for the introduction of gaps between
a model and its occurrences.

Paper in postscript format

Back to the *Publications* page