A double combinatorial approach to
discovering patterns in biological
sequences
Marie-France Sagot and Alain Viari
in Combinatorial Pattern
Matching'96, Laguna Beach, California, USA
Lecture Notes in
Computer Science, vol. 1075, pages 186-208, Springer Verlag,
1996
We present in this paper an algorithm for finding degenerated common
features by multiple comparison of a set of biological sequences
(nucleic acids or proteins). The features that are of interest to us
are words in the sequences. The algorithm uses the concept of a model
we introduced earlier for locating these features. A model can be
seen as a generalization of a consensus pattern as defined by Waterman
(Waterman, 1984). It is an object against which the words in the
sequences are compared and which serves as an identifier for the
groups of similar ones. The algorithm given here innovates in relation
to our previous work in that the models are defined over what we call
a weighted combinatorial cover. This is a collection of sets among all
possible subsets of the alphabet Sigma of nucleotides or amino
acids, including the wild card {Sigma}, with a weight attached
to each of these sets indicating the number of times it may appear in
a model. In this way, we explore both the space of models and that of
alphabets. The words that are related to a model defined over such a
combinatorial cover, and thus considered to be similar, are then the
ones that either belong to the model or present at most a certain
number of errors with a nearest element of it. We use two algorithmic
ideas that allow us to deal with such double combinatorics, one
concerns a left-to-right minimality of the sets composing a model, the
other involves making a sketch of the solution space before exploring
it in detail.
key words: multiple comparison, weighted combinatorial cover,
wild card, model, degenerated feature,
left-to-right minimality of sets, sketch of
solution space, DNA, protein
Paper in postscript format
Back to the Publications page