Looking for CpG islands

5 Looking for CpG islands

In mammal genomes, methylation of cytosine in dinucleotides CG (called CpG) entails low frequency of such dinucleotides, excepted in small segments, called CpG islands (see [PDM01]). A usual way to detect CpG islands is to compute the ratio number of observed CpG/number of expected CpG on sliding windows.

To goal of this part is to segment a mouse sequence to reveal CpG islands. The model used is 1-length HMM with two states: model of CpG island vs model of methylated sequence. Both models are learned from the proportions of words in sets of sequences, respectively in files mus_cpg.fa and mus_tem.fa.

Moreover, we use the information that in mouse genome, CpG islands are in average 1000 bases long, and are spaced by in average 125,000 bases.

We count the 1|1-proportions on the Lsequence built from both sets of sequences.

import lsequence
ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
import compte
c_cpg=compte.Compte()
for s in ls_cpg:
    c_cpg.add_seq(s,2)

c_tem=compte.Compte()
for s in ls_tem:
    c_tem.add_seq(s,2)

p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.strip().prop(lprior=1,lpost=1)
print p_tem

We have to remove the N letters:

p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_tem

Then we build a Lproportion for our HMM:

import lcompte
lp=lcompte.Lproportion()
lp[0]=p_cpg       # state 0 for cpg islands
lp[1]=p_tem       # state 1 for sequences between of cpg islands
lp.g_inter(0,0,1-1/1000.0)
lp.g_inter(0,1,1/1000.0)
lp.g_inter(1,0,1/125000.0)
lp.g_inter(1,1,1-1/125000.0)
print lp

To build partitions of the studied sequence, we have use a Lexique.

import sequence
s_mus=sequence.Sequence(fic="mus2.fa")
import lexique
lx_hmm=lexique.Lexique()
lx_hmm.read_Lprop(lp)

The first partition is computed via Viterbi algorithm.

import partition
p_vit=partition.Partition()
p_vit.viterbi(s_mus,lx_hmm)
print p_vit

The second partition is built via forward-backward algorithm, by the most probable state on each position. Then, before, the Matrice of the log-probabilities of the states must be computed.
```
import matrice
m_fb=matrice.Matrice()
m_fb.fb(s_mus,lx_hmm)
p_fb=partition.Partition()
p_fb.read_Matrice(m_fb)
print p_fb
```

We want to draw these partitions such that the height of each arc is proportional to the density in C+G.

We use a Lexique with the descriptor C+G.

lx_cg=lexique.Lexique(str="+(CG)")

And we draw the partitions:

p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))

We want to draw these partitions such that the height of each arc is proportional to the value:

number of observed CpG

number of expected CpG
=

number of CpG.length of the segment

number of C.number of G

We use a Lexique with 3 descriptors: C G and word CG, and define the function on a lexique, a sequence, and a segment, that returns the ratio.
```
lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
def height(l,sq, sg):
    dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
    return float(dl[2]*len(sg))/(dl[1]*dl[0])
```
And we draw the partitions:
```
p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
```

Now we compute the MPP corresponding to the Lexique; and we draw it, with the height of each arc proportional with the G+C density.

import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s_mus, lx_hmm, 50)
ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))

We want to compare those methods with some specifically made for CpG islands detection: CpGproD [PM01], CpGplot [LGLP92] and CpG Island Searcher [TJ02]. As the outputs of these programs have different syntaxes, they are converted in several files. Those files are, respectively, cpgprod.part, cpgplot.part, and cpgis.part.

They are all put un a Parti_simp, with the ones computed with HMM algorithms:
```
pscmp=parti_simp.Parti_simp()
pscmp.append(partition.Partition(fic="cpgprod.part"))
pscmp[-1].s_name("cpgprod")
pscmp.append(partition.Partition(fic="cpgplot.part"))
pscmp[-1].s_name("cpgplot")
pscmp.append(partition.Partition(fic="cpgis.part"))
pscmp[-1].s_name("cpgis")
pscmp.append(p_fb)
pscmp[-1].s_name("fb")
pscmp.append(p_vit)
pscmp[-1].s_name("vit")
```

We compare these partitions with the ones computed with MPP, and put the closest ones together, with the criterium of the number of same predictions.

lmax=[]
for p in pscmp:
    max=0
    imax=0
    for i in range(len(ps)):
        v=ps[i].pts_comm(p)
        if v>max:
              imax=i
              max=v
    lmax.append(imax)

j=1
for i in lmax:
    pscmp.insert(j,ps[i])
    j+=2

pscmp.draw_nf("tous_cpg.ps",seg=[0])