A double combinatorial approach to discovering patterns in biological sequences


Marie-France Sagot and Alain Viari
in Combinatorial Pattern Matching'96, Laguna Beach, California, USA
Lecture Notes in Computer Science, vol. 1075, pages 186-208, Springer Verlag, 1996

We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman (Waterman, 1984). It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet Sigma of nucleotides or amino acids, including the wild card {Sigma}, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and that of alphabets. The words that are related to a model defined over such a combinatorial cover, and thus considered to be similar, are then the ones that either belong to the model or present at most a certain number of errors with a nearest element of it. We use two algorithmic ideas that allow us to deal with such double combinatorics, one concerns a left-to-right minimality of the sets composing a model, the other involves making a sketch of the solution space before exploring it in detail.

key words: multiple comparison, weighted combinatorial cover, wild card, model, degenerated feature, left-to-right minimality of sets, sketch of solution space, DNA, protein

Paper in postscript format
Back to the Publications page