PRABI-Doua: CpGProD

CpGProD (CpG Island Promoter Detection)

Use of CpGProD

CpGProD is a program dedicated to the prediction of promoters associated with CpG islands (CGIs) in mammalian genomic sequences. CpGProD is available either via a web server, useful for a small dataset, or as a standalone application for a larger dataset (see below). You only need an entry sequence (or file) in FASTA format which has been masked by RepeatMasker.

Method of CpGProD

In vertebrate genomes, the CpG dinucleotides are present at about 25% of their expected frequency. This deficiency is due to the methylation of cytosine at CpG dinucleotides and the very high mutation rate of the methylated cytosines.

CGIs are stretches of DNA escaping methylation and exhibiting a high G+C content and CpG frequency relative to the bulk DNA (Bird, 1986). The CGIs are several hundreds base to several kilobase long and are dispersed throughout the genome. 50%-60% of the human genes exhibit a CGI over their Transcription Start Site (TSS) but all the CGIs are not associated with a TSS.

Some studies (Ioshikhes and Zhang, 2000; Ponger et al., 2001) have shown that the CGIs located over the TSS (start CGIs) are characterized by a particular structure compared to other CGIs (no-start CGIs): the start CGIs are longer and display a greater CpGo/e ratio and G+C level than no-start CGIs. CpGProD computes a score corresponding to the probability to be over the TSS (start-p value) from the length, the G+C content and the CpGo/e ratio of each CGI.

Moreover, two compositional biases between the plus and the minus strand of the start CGIs (Lobry, 1996) were observed in the start CGIs (Ponger, unpublished data). The CGIs located over the plus strand exhibit an excess of T compared to A and an excess of G compared to C. On the contrary, the CGIs located over the minus strand exhibit a depletion of T compared to A and a depletion of G compared to C. These biases are estimated by using two parameters, the AT-skew and the GC-skew. CpGProD calculates these parameters to predict the strand of each potential promoter and the probability to be over this strand.

Download CpGProD

Binaries (for Solaris, Linux, SGI, Macintosh and Windows), sources (written in C language), examples (input and output) and data (used to train and test CpGProD) files can be downloaded through our FTP server. You may also use the following table to directly access the files:

Solaris	Linux	SGI	Macintosh
Windows	Sources	Example	Data

Algorithm of CpGProD

This algorithm can be divided into two step.

First step: CpGProD searches for all the CGIs located along the sequence query.
Moving average values for G+C frequency and CpGo/e ratio are calculated by using a 500 nucleotides window moving along the sequence by steps of 1 nucleotide. Overlapping windows with a G+C frequency above 0.5 and a CpGo/e value greater than 0.6 are grouped together to form the CGIs. CGIs are defined as DNA regions longer than 500 nucleotides, with a moving average G+C frequency above 0.5 and a moving average CpGo/e ratio greater than 0.6. The CGIs separated by less than 200 nucleotides are grouped together.

CpGo/e ratio = CpG_observed / CpG_expected
with CpG expected = number_of_C * number_of_G / number_of_A_C_G_T

Second step: CpGProD identifies the potential promoters and their orientation.
*** For each detected CGI, CpGProD computes the probability to be a start CGIs (start_p). This probability is calculated from the length, the G+C frequency and the CpGo/e ratio of the CGI, according to the following relations:

human	mouse
Z= -7.271471 + 0.0005927055 * length + 4.043293 * G+C_frequency + 4.83027 * CpGo/e_ratio	Z= -19.4423 + -0.00008749142 * length + 15.27366 * G+C_frequency + 16.38997 * CpGo/e_ratio
start_p = exp(Z) / (1 + exp(Z))

These relations were determined by using two generalized linear models trained with two datasets composed by known start and no-start CGIs of human and mouse ( data_hum_start.txt and data_mus_start.txt).

*** For each detected CGI, CpGProD identifies orientation of the potential promoter by calculating the probability to be over the plus strand (plus_p). This probability is calculated from the AT skew and the GC skew values observed over the CGI as described below:

AT_skew = (A_nb - T_nb) / (A_nb + T_nb)
GC_skew = (G_nb - C_nb) / (G_nb + C_nb)

human	mouse
Z= 0.02853 + -11.01590 * AT_skew + 13.44387 * GC_skew	Z= 0.2161 + -12.3270 * AT_skew + 8.8730 * GC_skew
plus_p = exp(Z) / (1 + exp(Z))
If plus_p >= 0.5 then strand=plus and strand_p=plus_p else strand=minus and strand_p=1-plus_p

These relations were determined by using two generalized linear model trained with two datasets composed by start CGIs with known orientation ( data_hum_strand.txt and data_mus_strand.txt).

Supplementary information: CpGProD results

Table 1:

Results of CpGProD, CpG_promoter and PromoterInspector on different datasets. Threshold.: start-p thresholds used to identify promoters in CpGProD. Sensitivity (all): sensitivities calculated from the datasets composed by sequences with and without a start CGI. Sensitivity (CGI): sensitivities calculated from the datasets only composed by sequences with a start CGI. Specificity: specificities of the methods. Strand prediction: frequency of correct strand prediction.

Method	Data	Threshold	Sensitivity (all)	Sensitivity (CGI)	Specificity	Strand prediction
CpGProD (human option)	755 genes (32.8Mbp) CGI dataset	0.0	0.56	1.00	0.39	0.70
		0.3	0.27	0.48	0.51	0.73
		0.6	0.03	0.06	0.69	0.79
CpGProD (mouse option)	147 genes (2.4Mbp) CGI dataset	0.0	0.52	1.00	0.48	0.73
		0.3	0.48	0.93	0.74	0.76
		0.6	0.35	0.67	0.82	0.76
CpGProD (human option)	19 genes (825kbp) Ioshikhes and Zhang (2000)	0.0	-	0.84	0.43	-
CpGProD (human option)		0.3	-	0.74	0.87	-
CpG_promoter		-	-	0.62	0.62	-
CpGProD (human/mouse option)	35 genes (1.37Mbp) Sherf et al. (2000)	0.0	0.80	-	0.53	-
CpGProD (human/mouse option)		0.3	0.74	-	0.60	-
PromoterInspector		-	0.43	-	0.43	-

Table 2:

Results of CpGProD on the chromosome 22 dataset (Dunham et al., 1999) and on the Human Genome Project data (Lander et al., 2000). Nb: number of genes. Sens.: sensitivity observed for each class of genes. Spec.: specificity of the methods.

	Chromosome 22 (35Mbp) Dunham et al., 1999							Human Genome Project data (44 sequences, 3.4 Gbp) (release 12 dec 2000)
		CpGProD threshold 0.0 791 start CGIs		CpGProD threshold 0.3 355 start CGIs		PromoterInspector 465 promoters			CpGProD threshold 0.0 35050 start CGIs		CpGProD threshold 0.3 16023 start CGIs
	Nb	sens.	spec.	sens.	spec.	sens.	spec.	Nb	sens.	spec.	sens.	spec.
Known genes	247	0.64	-	0.40	-	0.45	-	12292	0.53	-	0.41	-
Predicted genes	298	0.47	-	0.30	-	0.25	-	27245	0.31	-	0.23	-
Total	545	0.51	0.40	0.38	0.62	0.33	0.40	36192 *	0.36	0.37	0.27	0.60

* the total number of genes do not corresponds to the sum of known and predicted genes since the redundancy existing between these two classes of genes was eliminated.

Reference

If you use CpGProD in a published work, please cite the following reference:

Ponger, L. and Mouchiroud, D. (2001) CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics, 18, 631-633 [ Abstract ]

If you have any problems or comments about CpGProD, please contact Loic Ponger.