Motif extraction in DNA and protein sequences | ||
Introduction:
If one detects a sequence that has remained highly conserved during evolution, then it probably means that this sequence has a function, for instance, as a promoter (but the reverse proposal is not true: a sequence can be functional although not conserved). Promoter or regulatory sequence conservation may therefore be identified by comparison of the non-coding sequences in a same organism, or by comparison of non-coding sequences of related genes in different organisms. In the same vein, protein binding sites may be found by comparison of protein sequences from a same family, or from different families but having a given function in common (for instance, that of binding a specific DNA regulatory sequence). We shall consider here motif extraction from DNA non-coding sequences belonging to a same organism and from protein sequences belonging to a same or to different families. We shall use for that two types of statistically-based algorithms: one which uses an Expectation-Maximization (abbreviated into EM) approach (MEME) and the other a Gibbs sampling approach (Motif Sampler). EM is a two-step iterative procedure for obtaining the global maximum likelihood parameter estimates for a model of observed data: a collection of sites of a given length whose positions are not known in a set of unaligned sequences, one site per sequence. The resulting collection corresponds to a log-likelihood matrix. The algorithm involves calculating expected parameter values that describe the data (i.e. the matrix of putative sites), then maximizing the likelihood of obtaining these values (i.e. refining the matrix by maximizing its relative information content). The two steps are repeated until the parameters that best explain the data are obtained, or until a fixed maximum number of iterations is reached. Two more recent algorithms (due to Cardon et al. and Frech et al.) are further able to handle variable spacers between parts of a motif. However, the process can be tedious, unless a priori information about the second element and/or spacer is available. Uncarefully chosen initial data sets may lead EM to converge inappropriately, either not at all, or to a local maximum. This problem is in part solved by using a Gibbs sampling approach (Lawrence et al.). The maximization step is replaced by one that increases the likelihood of observing the data with a certain probability only. The chance of reaching the maximum augments with the number of iterations. Lawrence et al. state that the algorithm provedly converges, although not always to the global maximum. These methods initially assumed, for computational as well as statistical reasons, that each sequence in the data set contained one, and only one site, whose length was known. The assumption of existence and unicity of the site in each sequence of a set is removed in the two more recent statistical methods MEME (due to Bailey and Gribskov) and Motif Sampler (due to Thijs et al). Although more frequently used for extracting conserved motifs from a set of protein sequences, MEME may also be applied to DNA sequences. The algorithm is basically an EM method with more parameters to estimate, among them, the number and length of sites as well as the number of sequences where each site is present. Thijs' Motif Sampler on the other hand is based on a Gibbs sampling approach and may for now be used only on DNA sequences. The DNA data we use as illustrations consist in two groups of sets of sequences. The first group is composed of two sets of well established Escherichia coli and Bacillus subtilis sequences containing an experimentally determined transcription start or promoter. In both cases, the sequences are aligned on the start of transcription. Sequences from Escherichia coli stop at that point, sequences from Bacillus subtilis contain 20 more bases downstream from the transcription start. The second group comprises two genomic sets of non coding sequences located between two divergent genes and extracted from the whole genomes of Escherichia coli and Bacillus subtilis. Sequences having less than 40 bases were eliminated and only up to 330 nucleotides before the start of translation (as annotated) were initially kept. The first and last 15 bases were then discarded. This eliminates the Shine-Dalgarno sequence as a potential motif. All the sequences in the first group of sets are supposed to contain the promoter consensus TTGACAx(16-18)TATAAT where x(16-18) means a spacer of 16 to 18 bases between the first and second parts of the promoter sequence. Some only of the sequences in the second group are believed to contain the same motif. The protein data consist in two groups of sets of sequences. The first one contains 3 sequences of cytochromes p450, the second sequences of 30 proteins which contain an helix-turn-helix (HTH) motif which is known to bind DNA. This second set is therefore supposed to contain just one motif that is common to all sequences while the first set corresponds to proteins which belong to a same family and share many common structural motifs (at least 15 according to a study by Jean et al. -- see the following table). The question in both cases is whether these motifs can be identified by looking at the sequences only. First step: Create two directories, one called DNA and the other protein, and download the following files:
Second step: Start working with the DNA datasets by applying to them the two algorithms, MEME and Motif sampler, which may be used through the web.
Third step: Work now with the protein datasets by applying MEME to them.
|