Multiple comparison - A peptide
matching approach
Marie-France Sagot, Alain Viari and Henri Soldano
Theoretical Computer Science
180:115-137, 1997
We present in this paper a peptide matching approach to the multiple
comparison of a set of protein sequences. This approach consists in
looking for all the words that are common to q of these sequences,
where q is a parameter.
The comparison between words is done by using as reference an object
called a model. In the case of proteins, a model is a product of
subsets of the alphabet Sigma of the amino acids. These subsets
belong to a cover of Sigma, that is, their union covers all of
Sigma. A word is said to be an instance of a model if it
belongs to the model.
A further flexibility is introduced in the comparison by allowing for
up to e errors in the comparison between a word and a
model. These errors may concern gaps or substitutions not allowed by
the cover. A word is said to be this time an occurrence of a model if
the Levenshtein distance between it and an instance of the model is
inferior or equal to e. This corresponds to what we call a
Set-Levenshtein distance between the occurrences and the model
itself. Two words are said to be similar if there is at least one
model of which both are occurrences. In the special case where e =
0, the occurrences of a model are simply its instances. If a model
M has occurrences in at least q of the sequences of the
set, M is said to occur in the set.
The algorithm presented here is an efficient and exact way of looking
for all the models, of a fixed length k or of the greatest
possible length k_{max}, that occur in a set of sequences. It
is linear in the total length n of the sequences and
proportional to (e+2) * (2e+1) * k^(e+1) * p^(e+1) * g^k where
k << n is a small value in all practical
situations, p is the number of sets in the cover and g
is related to the latter's degree of non transitivity.
Models are closely related to what is called a consensus in the
biocomputing area, and covers are a good way of representing complex
relationships between the amino acids.
key words: multiple comparison, cover, model, Levenshtein
distance, Set-Levenshtein distance
Paper in postscript format
Back to the Publications page