Motif extraction in DNA and protein sequences

Introduction:

If one detects a sequence that has remained highly conserved during evolution, then it probably means that this sequence has a function, for instance, as a promoter (but the reverse proposal is not true: a sequence can be functional although not conserved). Promoter or regulatory sequence conservation may therefore be identified by comparison of the non-coding sequences in a same organism, or by comparison of non-coding sequences of related genes in different organisms. In the same vein, protein binding sites may be found by comparison of protein sequences from a same family, or from different families but having a given function in common (for instance, that of binding a specific DNA regulatory sequence). We shall consider here motif extraction from DNA non-coding sequences belonging to a same organism and from protein sequences belonging to a same or to different families. We shall use for that two types of statistically-based algorithms: one which uses an Expectation-Maximization (abbreviated into EM) approach (MEME) and the other a Gibbs sampling approach (Motif Sampler).

EM is a two-step iterative procedure for obtaining the global maximum likelihood parameter estimates for a model of observed data: a collection of sites of a given length whose positions are not known in a set of unaligned sequences, one site per sequence. The resulting collection corresponds to a log-likelihood matrix. The algorithm involves calculating expected parameter values that describe the data (i.e. the matrix of putative sites), then maximizing the likelihood of obtaining these values (i.e. refining the matrix by maximizing its relative information content). The two steps are repeated until the parameters that best explain the data are obtained, or until a fixed maximum number of iterations is reached. Two more recent algorithms (due to Cardon et al. and Frech et al.) are further able to handle variable spacers between parts of a motif. However, the process can be tedious, unless a priori information about the second element and/or spacer is available.

Uncarefully chosen initial data sets may lead EM to converge inappropriately, either not at all, or to a local maximum. This problem is in part solved by using a Gibbs sampling approach (Lawrence et al.). The maximization step is replaced by one that increases the likelihood of observing the data with a certain probability only. The chance of reaching the maximum augments with the number of iterations. Lawrence et al. state that the algorithm provedly converges, although not always to the global maximum.

These methods initially assumed, for computational as well as statistical reasons, that each sequence in the data set contained one, and only one site, whose length was known.

The assumption of existence and unicity of the site in each sequence of a set is removed in the two more recent statistical methods MEME (due to Bailey and Gribskov) and Motif Sampler (due to Thijs et al). Although more frequently used for extracting conserved motifs from a set of protein sequences, MEME may also be applied to DNA sequences. The algorithm is basically an EM method with more parameters to estimate, among them, the number and length of sites as well as the number of sequences where each site is present. Thijs' Motif Sampler on the other hand is based on a Gibbs sampling approach and may for now be used only on DNA sequences.

The DNA data we use as illustrations consist in two groups of sets of sequences. The first group is composed of two sets of well established Escherichia coli and Bacillus subtilis sequences containing an experimentally determined transcription start or promoter. In both cases, the sequences are aligned on the start of transcription. Sequences from Escherichia coli stop at that point, sequences from Bacillus subtilis contain 20 more bases downstream from the transcription start. The second group comprises two genomic sets of non coding sequences located between two divergent genes and extracted from the whole genomes of Escherichia coli and Bacillus subtilis. Sequences having less than 40 bases were eliminated and only up to 330 nucleotides before the start of translation (as annotated) were initially kept. The first and last 15 bases were then discarded. This eliminates the Shine-Dalgarno sequence as a potential motif.

All the sequences in the first group of sets are supposed to contain the promoter consensus TTGACAx(16-18)TATAAT where x(16-18) means a spacer of 16 to 18 bases between the first and second parts of the promoter sequence. Some only of the sequences in the second group are believed to contain the same motif.

The protein data consist in two groups of sets of sequences. The first one contains 3 sequences of cytochromes p450, the second sequences of 30 proteins which contain an helix-turn-helix (HTH) motif which is known to bind DNA. This second set is therefore supposed to contain just one motif that is common to all sequences while the first set corresponds to proteins which belong to a same family and share many common structural motifs (at least 15 according to a study by Jean et al. -- see the following table). The question in both cases is whether these motifs can be identified by looking at the sequences only.

First step: Create two directories, one called DNA and the other protein, and download the following files:

into the DNA directory
- promo_coli.seq: a sequence library in FASTA format containing the DNA sequences from Escherichia coli where a promoter or, more generally, the transcriptional starting point has been experimentally determined.
- promo_subtilis.seq: a sequence library in FASTA format containing the DNA sequences from Bacillus subtilis where a promoter or, more generally, the transcriptional starting point has been experimentally determined.
- dnc_coli.seq: a sequence library in FASTA format containing a set of non coding sequences located between divergent genes in Escherichia coli.
- dnc_subtilis.seq: a sequence library in FASTA format containing a set of non coding sequences located between divergent genes in Bacillus subtilis.
into the protein directory
- P450_3.seq: a sequence library in FASTA format containing 3 sequences of cytochromes p450.
- hthLawrence.seq : a sequence library in FASTA format containing 30 sequences of proteins which have an HTH motif.

Second step: Start working with the DNA datasets by applying to them the two algorithms, MEME and Motif sampler, which may be used through the web.

Step 2a: Use MEME to extract conserved motifs in the sequences.
- compare the results obtained in the two sets containing an experimentally determined transcription start or promoter sequence: does the known consensus appear among the motifs extracted, are the consensi obtained the same for both bacteria, what do you thing of the other motifs found (hint: look at their positions in the sequence and then examine the information provided by the algorithm for each motif) ?
  (if you have any difficulties in obtaining the results, here are those for Escherichia coli (with the MAST file here), and here for Bacillus subtilis (with the MAST file here))
- compare the results obtained in the two genomic sets: do you obtain the same motifs in the two sets, are these the same as in the experimental sets ? In the second case, if the results are not the same, do you have you an idea why ? What experiment could you do to test your idea ?
  (if you have any difficulties in obtaining the results, here are those for Escherichia coli (with the MAST file here))
- with either the experimental or the genomic datasets, try playing with the various parameters: does it change anything in the results obtained ?
  (if you have any difficulties in obtaining the results, here is one for genomic Escherichia coli when palindromic motifs required (with the MAST file here), here one for experimental Bacillus subtilis when the length of the motifs is fixed at 20 (with the MAST file here), and here one for experimental Bacillus subtilis when the length of the motifs is fixed at 6 (with the MAST file here))
Step 2b: Use Motif Sampler to extract conserved motifs in the sequences. Compare the results obtained with any of the sets to those given by MEME. Are there any differences ?
(if you have any difficulties in obtaining the results, here are those for Escherichia coli, and here for Bacillus subtilis)
Step 2c (optional): Use the weight matrices derived with MEME in the experimental sets to search for potential promoter sequences in the genomic sets first, and then in the whole genomes of the two bacteria (supplementary exercise: recover them yourself from the web). What do you think of the results, in particular of the number of predicted sites ?
(if you have any difficulties in obtaining the results, here are those for genomic Escherichia coli and for non coding Escherichia coli , and here for genomic Bacillus subtilis and for non coding Bacillus subtilis)

Third step: Work now with the protein datasets by applying MEME to them.

Step 3a: Dataset containing 3 cytochromes p450.
- compare the results obtained with those given by the table: how may structural motifs which were determined by Jean et al. are identified (appear statistically significant) at the sequence level ?
  (if you have any difficulties in obtaining the results, here are those when 10 motifs are asked, with the MAST file here)
- since the proteins belong to the same family and should align reasonably well at a global level, try performing a multiple alignment of their sequences using any of the algorithms seen for this. Compare the results with both the table and the results obtained with MEME.
  (if you have any difficulties in obtaining the results, here are them)
Step 3b: Use the weight matrices derived with MEME to search for proteins from the same family in Swiss Prot. What do you think of the results you get ?
(if you have any difficulties in obtaining the results, here are them)
Step 3c: Dataset containing 30 proteins having a HTH motif: How many motifs appear statistically significant ?
(if you have any difficulties in obtaining the results, here are them, with matrices for MAST here)
Step 3d: Use the weight matrices derived with MEME to search for proteins from the same family in Swiss Prot. What do you think of the results you get ?
(if you have any difficulties in obtaining the results, here are them)