CpGProD (CpG Island Promoter Detection)
Use of CpGProD
CpGProD is a program dedicated to the prediction of
promoters associated with CpG islands (CGIs) in mammalian genomic sequences.
CpGProD is available either via a web server,
useful for a small dataset, or as a standalone application for a larger dataset
(see below). You only need an entry sequence (or file) in FASTA format which
has been masked by RepeatMasker.
Method of CpGProD
In vertebrate genomes, the CpG dinucleotides are present at about 25% of their
expected frequency. This deficiency is due to the methylation of cytosine at
CpG dinucleotides and the very high mutation rate of the methylated cytosines.
CGIs are stretches of DNA escaping methylation and exhibiting a high G+C content
and CpG frequency relative to the bulk DNA (Bird, 1986). The CGIs are several
hundreds base to several kilobase long and are dispersed throughout the genome.
50%-60% of the human genes exhibit a CGI over their Transcription Start Site (TSS)
but all the CGIs are not associated with a TSS.
Some studies (Ioshikhes and Zhang, 2000; Ponger et al., 2001) have
shown that the CGIs located over the TSS (start CGIs) are characterized by a
particular structure compared to other CGIs (no-start CGIs): the start CGIs are
longer and display a greater CpGo/e ratio and G+C level than no-start CGIs.
CpGProD computes a score corresponding to the probability to be over the TSS
(start-p value) from the length, the G+C content and the CpGo/e ratio of each
CGI.
Moreover, two compositional biases between the plus and the minus strand of the
start CGIs (Lobry, 1996) were observed in the start CGIs (Ponger, unpublished data).
The CGIs located over the plus strand exhibit an excess of T compared to A and an
excess of G compared to C. On the contrary, the CGIs located over the minus strand
exhibit a depletion of T compared to A and a depletion of G compared to C. These
biases are estimated by using two parameters, the AT-skew and the GC-skew. CpGProD
calculates these parameters to predict the strand of each potential promoter and the
probability to be over this strand.
Download CpGProD
Binaries (for Solaris, Linux, SGI, Macintosh and Windows), sources (written in C
language), examples (input and output) and data (used to train and test CpGProD)
files can be downloaded through our
FTP server.
You may also use the following table to directly access the files:
Algorithm of CpGProD
This algorithm can be
divided into two step.
First step: CpGProD searches for all the
CGIs located along the sequence query.
Moving average values for G+C
frequency and CpGo/e ratio are calculated by using a 500 nucleotides window
moving along the sequence by steps of 1 nucleotide. Overlapping windows with a
G+C frequency above 0.5 and a CpGo/e value greater than 0.6 are grouped together
to form the CGIs. CGIs are defined as DNA regions longer than 500 nucleotides,
with a moving average G+C frequency above 0.5 and a moving average CpGo/e ratio
greater than 0.6. The CGIs separated by less than 200 nucleotides are grouped
together.
CpGo/e ratio = CpG_observed / CpG_expected
with CpG expected = number_of_C * number_of_G / number_of_A_C_G_T
Second step: CpGProD identifies the potential promoters and their orientation.
*** For each detected CGI, CpGProD computes the probability to be a
start CGIs (start_p). This probability is calculated from the length, the G+C
frequency and the CpGo/e ratio of the CGI, according to the following relations:
| human |
mouse |
Z= -7.271471 + 0.0005927055 * length + 4.043293 * G+C_frequency +
4.83027 * CpGo/e_ratio |
Z= -19.4423 + -0.00008749142 * length + 15.27366 * G+C_frequency +
16.38997 * CpGo/e_ratio |
| start_p = exp(Z) / (1 + exp(Z)) |
These
relations were determined by using two generalized linear models trained with
two datasets composed by known start and no-start CGIs of human and mouse (data_hum_start.txt
and data_mus_start.txt).
*** For each detected CGI, CpGProD identifies orientation of the
potential promoter by calculating the probability to be over the plus strand
(plus_p). This probability is calculated from the AT skew and the GC skew values
observed over the CGI as described below:
AT_skew = (A_nb - T_nb) / (A_nb + T_nb)
GC_skew = (G_nb - C_nb) /
(G_nb + C_nb)
| human |
mouse |
| Z= 0.02853 + -11.01590 * AT_skew + 13.44387 * GC_skew |
Z= 0.2161 + -12.3270 * AT_skew + 8.8730 * GC_skew |
| plus_p = exp(Z) / (1 + exp(Z)) |
If plus_p >= 0.5 then strand=plus and
strand_p=plus_p else strand=minus and
strand_p=1-plus_p |
These relations were determined
by using two generalized linear model trained with two datasets composed by
start CGIs with known orientation (data_hum_strand.txt
and data_mus_strand.txt).
Supplementary information: CpGProD results
Table 1:
Results of CpGProD, CpG_promoter and PromoterInspector
on different datasets. Threshold.: start-p thresholds used to identify
promoters in CpGProD. Sensitivity (all): sensitivities calculated from the
datasets composed by sequences with and without a start CGI. Sensitivity (CGI):
sensitivities
calculated from the datasets only composed by sequences with a start CGI. Specificity:
specificities of the methods. Strand prediction: frequency of correct strand prediction.
| Method |
Data |
Threshold |
Sensitivity (all) |
Sensitivity (CGI) |
Specificity |
Strand prediction |
CpGProD (human option) |
755 genes (32.8Mbp) CGI dataset |
0.0 |
0.56 |
1.00 |
0.39 |
0.70 |
| 0.3 |
0.27 |
0.48 |
0.51 |
0.73 |
| 0.6 |
0.03 |
0.06 |
0.69 |
0.79 |
CpGProD (mouse option) |
147 genes (2.4Mbp) CGI dataset |
0.0 |
0.52 |
1.00 |
0.48 |
0.73 |
| 0.3 |
0.48 |
0.93 |
0.74 |
0.76 |
| 0.6 |
0.35 |
0.67 |
0.82 |
0.76 |
CpGProD (human option) |
19 genes (825kbp) Ioshikhes and Zhang (2000) |
0.0 |
- |
0.84 |
0.43 |
- |
| 0.3 |
- |
0.74 |
0.87 |
- |
| CpG_promoter |
- |
- |
0.62 |
0.62 |
- |
CpGProD (human/mouse option) |
35 genes (1.37Mbp) Sherf et al. (2000) |
0.0 |
0.80 |
- |
0.53 |
- |
| 0.3 |
0.74 |
- |
0.60 |
- |
| PromoterInspector |
- |
0.43 |
- |
0.43 |
- |
Table 2:
Results of CpGProD on the chromosome 22 dataset (Dunham et al., 1999)
and on the Human Genome Project data (Lander et al., 2000).
Nb: number of genes. Sens.: sensitivity observed for each class
of genes. Spec.: specificity of the methods.
|
Chromosome 22 (35Mbp) Dunham et al., 1999 |
Human Genome Project data (44 sequences, 3.4 Gbp) (release 12 dec 2000) |
|
- |
CpGProD threshold 0.0 791 start CGIs |
CpGProD threshold 0.3 355 start CGIs |
PromoterInspector 465 promoters |
- |
CpGProD threshold 0.0 35050 start CGIs |
CpGProD threshold 0.3 16023 start CGIs |
|
Nb |
sens. |
spec. |
sens. |
spec. |
sens. |
spec. |
Nb |
sens. |
spec. |
sens. |
spec. |
| Known genes |
247 |
0.64 |
- |
0.40 |
- |
0.45 |
- |
12292 |
0.53 |
- |
0.41 |
- |
| Predicted genes |
298 |
0.47 |
- |
0.30 |
- |
0.25 |
- |
27245 |
0.31 |
- |
0.23 |
- |
| Total |
545 |
0.51 |
0.40 |
0.38 |
0.62 |
0.33 |
0.40 |
36192 * |
0.36 |
0.37 |
0.27 |
0.60 |
* the total number of genes do not corresponds to the sum of known and predicted
genes since the redundancy existing between these two classes of genes was eliminated.
Reference
If you use CpGProD in a published work, please cite the following reference:
- Ponger, L. and Mouchiroud, D. (2001) CpGProD: identifying CpG islands associated
with transcription start sites in large genomic mammalian sequences.
Bioinformatics, 18, 631-633
[ Abstract ]
If you have any problems or comments about CpGProD, please contact
Loic Ponger.
If you have problems
or comments...
Back to PBIL home page