Chaînes de Markov et chaînes de Markov cachées en analyse de génomes

Bernard Prum
Laboratoire de Statistique et Génome
CNRS - INRA
Université d'Evry
91000 Évry
FRANCE
E-Mail: prum@genopole.cnrs.fr

Biological sequences essentially consist in DNA chains, the chromosomes which transmit the information from a generation to the following one, and proteic chains, the proteins being the essential component of all phenomena in living cells. The first ones are written in a 4 letters alphabet {a, c, g, t} while the second ones contain 20 letters, the amino-acid.

Daily, more than 20 millions of new deciphered letters arrive in the data banks and a challenge for the statisticians is to help the biologist for finding the relevant information in this huge amount of data.

A first topic we are interesed by consists in searching words whose frequency is too high to let believe it results from pure randomness. As an example, in bacterial genomes exists some signal (called CHI) which participates to their defenses and must therefore be sufficiently frequent to be efficient. Hence CHI's role is irrelevant with the usual genetic code but has another importance for the organism.

To search for these exceptionnal words, we look for a modelisation which could be both satisfactory for the biologist and tractable for the mathematician. One has to take into account the frequencies of the letters, of the 2-letters words, 3-letters words, etc..., hence to work conditionnally to the sufficient statistics of a Markov chain model. In these models for each word W, using a conditionnal approach, we compute the expectation and the variance of the number of occurrences and give result about its (asymptotic) law.

A very relevant criticism done against this modelisation is that it assumes the homogeneity of the sequence, and this hypothesis is worst and worst admitted by the biologists when they deal with larger and larger sequences. One way for answering these criticisms consists in allowing the simultaneous existence of more than one markovian model and this led us to work with Hidden Markov Models (HMM). These models quickly turn out to be statistical tools permiting much more than the separate analysis of regions choosed to be homogeneous. The fact that, at the begining of the algorithm, we must nor fix the markovian transition in each state nor the positions of the various states implies that adjusting a HMM on a sequence produces its segmentation by allocating a common characteristic to all the segments related to a same state.

An important drawback of the "classical" modelisation by HMM is that it implies that the areas corresponding to a same state must have length distributed according to an exponential law, and this is not at all verified in the reality of genomes. Semi-markovian models solve this difficulty: they allow every law for the length of the various area.

Joined with the use of charateristics of the biological context, these methods must significatively improve the performances of the predictions of homogeneous regions. We will present a few applications as search of "horizontal transfers" and "annotation"

Since some 10 years, it is admitted that beside the vertical transmission (from parents to offsprings), a phenomenon of horizontal transmission of genetical information plays an important role in the evolution of life. For example some viruses may copy a part of the genome of some individual and transport and incorporate it in the genome of another individual may be of an other species. The potential profit of this phenomenon is obvious: through such tranposons, a new beneficial gene can spread in a great number of species. As it is well known that each species leads to a different adjustment of a Markov model (frequencies of words change from a species to another), modelisation using HMM is perfectly adapted for searching tranposons.

The matter of "annotation" is to contribute to an automatic research in DNA sequences of coding parts, and within these of exons and introns (in "eucaryotes" - essentially every species except bacteriae - genes contain two kinds of regions: exon message is in fine translated into the proteins, while introns disappear during the maturation process). HMM is also a successful approach for this problem.

Retour au programme