Inferring regulatory elements
from a whole genome. An analysis of Helicobacter pylori sigma(80)
family of promoter signals
Anne Vanet, Laurent Marsan, Agnès Labigne and Marie-France
Sagot
Journal of Molecular Biology, 297:335-353,
2000
Helicobacter pylori is adapted to life in a unique niche, the
gastric epithelium of primates. Its promoters may therefore be
different from those of other bacteria. Here, we determine motifs
possibly involved in the recognition of such promoter sequences by the
RNA polymerase using a new motif identification method. An important
feature of this method is that the motifs are sought with the least
possible assumptions about what they may look like. The method starts
by considering the whole genome of H. pylori and attempts to
infer directly from it a description for a family of promoters. Thus,
this approach differs from searching for such promoters with a
previously established description. The two algorithms are based on
the idea of inferring motifs by flexibly comparing words in the
sequences with an external object, instead of between themselves. The
first algorithm infers single motifs, the second a combination of two
motifs separated from one another by strictly defined, sterically
constrained distances. Besides independently finding motifs known to
be present in other bacteria, such as the Shine-Dalgarno sequence and
the TATA-box, this approach suggests the existence in H. pylori
of a new, combined motif, TTAAGC, followed optimally 21 bp downstream
by TATAAT. Between these two motifs, there is in some cases another,
TTTTAA or, less frequently, a repetition of TTAAGC separated optimally
from the TATA-box by 12 bp. The combined motif TTAAGCx(21+/-2)TATAAT
is present with no errors immediately upstream from the only two
copies of the ribosomal 23 S-5 S RNA genes in H. pylori, and
with one error upstream from the only two copies of the ribosomal 16 S
RNA genes. The operons of both ribosomal RNA molecules are strongly
expressed, representing an encouraging sign of the pertinence of the
motifs found by the algorithms. In 25 cases out of a possible 30, the
combined motif is found with no more than three substitutions
immediately upstream from ribosomal proteins, or operons containing a
ribosomal protein. This is roughly the same frequency of occurrence as
for TTGACAx(15-19)TATAAT (with the same maximum number of
substitutions allowed) described as being the sigma(70 )promoter
sequence consensus in Bacillus subtilis and Escherichia
coli. The frequency of occurrence of the new motif obtained,
TTAAGCx(19-23)TATAAT, remains high when all protein genes in
H. pylori are considered, as is the case for the
TTGACAx(15-19)TATAAT motif in B. subtilis but not in
E. coli.
Paper in pdf format
Back to the Publications page