Looking for CpG islands

In mammal genomes, methylation of cytosine in dinucleotides CG (called CpG) entails low frequency of such dinucleotides, excepted in small segments, called CpG islands (see [PDM01]). A usual way to detect CpG islands is to compute the ratio number of observed CpG/number of expected CpG on sliding windows.

To goal of this part is to segment a mouse sequence to reveal CpG islands. The modelling used has two states: model of CpG island vs model of methylated sequence. Both models are learned from the proportions of words in sets of sequences, respectively in files mus_cpg.fa and mus_tem.fa.

First, we use the MPP algorithm and compute the segmentation probability to build a maximum prediction partition.

After, we proceed an HMM analysis. For this, we introduce the information that in mouse genome, CpG islands are in average 1000 bases long, and are spaced by in average 125,000 bases.

At the end, we draw the partitions computed by these methods with some computed by CpG specialised methods.

We count the 1|1-proportions on the Lsequence built from both sets of sequences.

import lsequence ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa") ls_tem=lsequence.Lsequence(fic="mus_tem.fa") import compte c_cpg=compte.Compte() for s in ls_cpg: c_cpg.add_seq(s,2) c_tem=compte.Compte() for s in ls_tem: c_tem.add_seq(s,2) p_cpg=c_cpg.strip().prop(lprior=1,lpost=1) print p_cpg p_tem=c_tem.strip().prop(lprior=1,lpost=1) print p_tem

We have to remove the N letters:

p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1) print p_cpg p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1) print p_tem

Then we build a Lproportion for our modelling. We introduce the inter-state probabilities for the HMM:

import lcompte lp=lcompte.Lproportion() lp[0]=p_cpg # state 0 for cpg islands lp[1]=p_tem # state 1 for sequences between of cpg islands lp.g_inter(0,0,1-1/1000.0) lp.g_inter(0,1,1/1000.0) lp.g_inter(1,0,1/125000.0) lp.g_inter(1,1,1-1/125000.0) print lp

To build partitions of the studied sequence, we have to use a Lexique.

import sequence s_mus=sequence.Sequence(fic="mus2.fa") import lexique lx_hmm=lexique.Lexique() lx_hmm.read_Lprop(lp)

We compute the segmentation likelihood of the sequence from the models used to build the sequence, up to 200 classes.

lprob=lx_hmm.log_likelihood(s_mus,100)

We want to compute the variance of the distribution of the probabilities of the Sequence.

We build a Matrice, corresponding to the predictions of the Lexique lx_hmm on the Sequence.
import matrice m=matrice.Matrice() m.prediction(s_mus,lx_hmm)
We build the Lexique for the 2nd moment, and compute it. After we compute the variance of the probabilities.
import math lx2=lexique.Lexique(str="0:#0(2) 1:#1(2)") lprob2=lx2.log_likelihood(m,100) lvar=[] for i in range(len(lprob)): lvar.append(lprob2[i]+math.log(1+math.exp(2*lprob[i]-lprob2[i])))

We compute the MPP corresponding to the Lexique.; and we draw it, with the height of each arc proportional with the G+C density.

import parti_simp ps=parti_simp.Parti_simp() ps.mpp(s_mus, lx_hmm, 50)

We want to draw the partitioning such that the height of each arc is proportional to the density in C+G. So, we use a Lexique with the descriptor C+G.

lx_cg=lexique.Lexique(str="+(CG)") ps.draw_nf("mus_ps_cg.ps",\

We compute the a posteriori log-probabilities, with an a priori P(ℙ_k|D) ∝ 0.546^k, and select the partition of the maximal estimator:

lap=[] for i in range(len(lprob)): lap.append(lprob[i]+(i+1)*math.log(0.546)) nbc=lap.index(max(lap))+1 p_nbc=ps[nbc-1]

The next partition is computed via Viterbi algorithm.

import partition p_vit=partition.Partition() p_vit.viterbi(s_mus,lx_hmm) print p_vit

The following one is built via forward-backward algorithm, by the most probable state on each position. Then, before, the Matrice of the log-probabilities of the states must be computed.

import matrice m_fb=matrice.Matrice() m_fb.fb(s_mus,lx_hmm) p_fb=partition.Partition() p_fb.read_Matrice(m_fb) print p_fb

And we draw the partitions:

p_nbc.draw_nf("mus_nbc_cg.ps",func=lambda s: \ lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s)) p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: \ lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s)) p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: \ lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))

We want to draw these partitions such that the height of each arc is proportional to the value:

number of observed CpG

number of expected CpG

number of CpG.length of the segment

number of C.number of G

We use a Lexique with 3 descriptors: C G and word CG, and define the function on a lexique, a sequence, and a segment, that returns the ratio.

lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'") def height(l,sq, sg): dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin()) if dl[1]*dl[0]==0: return 0 else: return float(dl[2]*len(sg))/(dl[1]*dl[0])

And we draw the partitions:

p_nbc.draw_nf("mus_nbc_oe.ps",func=lambda s: height(lx_oe,s_mus,s)) p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s)) p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))

We want to compare those methods with some specifically made for CpG islands detection: CpGproD [PM01], CpGplot [LGLP92] and CpG Island Searcher [TJ02]. As the outputs of these programs have different syntaxes, they are converted in several files. Those files are, respectively, cpgprod.part, cpgplot.part, and cpgis.part.

They are all put un a Parti_simp, with the ones computed with HMM algorithms:

pscmp=parti_simp.Parti_simp() pscmp.append(partition.Partition(fic="cpgprod.part")) pscmp[-1].s_name("cpgprod") pscmp.append(partition.Partition(fic="cpgplot.part")) pscmp[-1].s_name("cpgplot") pscmp.append(partition.Partition(fic="cpgis.part")) pscmp[-1].s_name("cpgis") pscmp.append(p_fb) pscmp[-1].s_name("fb") pscmp.append(p_vit) pscmp[-1].s_name("vit") pscmp.append(p_nbc) pscmp[-1].s_name("mpp") pscmp.draw_nf("tous_cpg.ps",seg=[0])

7 Looking for CpG islands