In mammal genomes, methylation of cytosine in dinucleotides CG (called
CpG) entails low frequency of such dinucleotides, excepted in small
segments, called CpG islands (see [PDM01]). A usual way to
detect CpG islands is to compute the ratio number of
observed CpG/number of expected CpG on sliding
windows.
To goal of this part is to segment a mouse sequence to reveal CpG
islands. The modelling used has two states: model of CpG island vs
model of methylated sequence. Both models are learned from the
proportions of words in sets of sequences, respectively in files
mus_cpg.fa and
mus_tem.fa.
First, we use the MPP algorithm and compute the segmentation
probability to build a maximum prediction partition.
After, we proceed an HMM analysis. For this, we introduce the
information that in mouse genome, CpG islands are in average 1000
bases long, and are spaced by in average 125,000 bases.
At the end, we draw the partitions computed by these methods with some
computed by CpG specialised methods.
- We count the 1|1-proportions
on the Lsequence built from both
sets of sequences.
import lsequence
ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
import compte
c_cpg=compte.Compte()
for s in ls_cpg:
c_cpg.add_seq(s,2)
c_tem=compte.Compte()
for s in ls_tem:
c_tem.add_seq(s,2)
p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.strip().prop(lprior=1,lpost=1)
print p_tem
- We have to remove the N letters:
p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_tem
- Then we build a Lproportion
for our modelling. We introduce the inter-state probabilities for
the HMM:
import lcompte
lp=lcompte.Lproportion()
lp[0]=p_cpg # state 0 for cpg islands
lp[1]=p_tem # state 1 for sequences between of cpg islands
lp.g_inter(0,0,1-1/1000.0)
lp.g_inter(0,1,1/1000.0)
lp.g_inter(1,0,1/125000.0)
lp.g_inter(1,1,1-1/125000.0)
print lp
- To build partitions of the
studied sequence, we have to use a
Lexique.
import sequence
s_mus=sequence.Sequence(fic="mus2.fa")
import lexique
lx_hmm=lexique.Lexique()
lx_hmm.read_Lprop(lp)
- We compute the segmentation likelihood of the sequence from the
models used to build the sequence, up to 200 classes.
lprob=lx_hmm.log_likelihood(s_mus,100)
- We want to compute the variance of the distribution of the
probabilities of the Sequence.
- We build a Matrice,
corresponding to the predictions of the
Lexique
lx_hmm
on the Sequence.import matrice
m=matrice.Matrice()
m.prediction(s_mus,lx_hmm)
- We build the Lexique for the
2nd moment, and compute it. After we compute the variance of the
probabilities.
import math
lx2=lexique.Lexique(str="0:#0(2) 1:#1(2)")
lprob2=lx2.log_likelihood(m,100)
lvar=[]
for i in range(len(lprob)):
lvar.append(lprob2[i]+math.log(1+math.exp(2*lprob[i]-lprob2[i])))
- We compute the MPP corresponding to the Lexique.; and we draw
it, with the height of each arc proportional with the G+C density.
import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s_mus, lx_hmm, 50)
- We want to draw the partitioning such that the height of each
arc is proportional to the density in C+G. So, we use a Lexique with
the descriptor C+G.
lx_cg=lexique.Lexique(str="+(CG)")
ps.draw_nf("mus_ps_cg.ps",\
- We compute the a posteriori log-probabilities, with an
a priori P(ℙk|D) ∝ 0.546k, and select the
partition of the maximal estimator:
lap=[]
for i in range(len(lprob)):
lap.append(lprob[i]+(i+1)*math.log(0.546))
nbc=lap.index(max(lap))+1
p_nbc=ps[nbc-1]
- The next partition is computed via Viterbi algorithm.
import partition
p_vit=partition.Partition()
p_vit.viterbi(s_mus,lx_hmm)
print p_vit
- The following one is built via forward-backward algorithm, by
the most probable state on each position. Then, before, the
Matrice of the log-probabilities of
the states must be computed.
import matrice
m_fb=matrice.Matrice()
m_fb.fb(s_mus,lx_hmm)
p_fb=partition.Partition()
p_fb.read_Matrice(m_fb)
print p_fb
- And we draw the partitions:
p_nbc.draw_nf("mus_nbc_cg.ps",func=lambda s: \
lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: \
lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: \
lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to draw these partitions such that the height of each
arc is proportional to the value:
number of observed
CpG |
|
number of expected CpG |
| = | number of
CpG.length of the segment |
|
number of
C.number of G |
| |
We use a Lexique with 3 descriptors: C G and word CG, and define the
function on a lexique, a sequence, and a segment, that returns the ratio.
lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
def height(l,sq, sg):
dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
if dl[1]*dl[0]==0:
return 0
else:
return float(dl[2]*len(sg))/(dl[1]*dl[0])
And we draw the partitions:
p_nbc.draw_nf("mus_nbc_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
- We want to compare those methods with some specifically made for
CpG islands detection: CpGproD [PM01],
CpGplot [LGLP92] and CpG Island
Searcher [TJ02]. As the outputs of these programs have
different syntaxes, they are converted in several files. Those files
are, respectively, cpgprod.part,
cpgplot.part, and
cpgis.part.
They are all put un a Parti_simp, with the ones computed
with HMM algorithms:
pscmp=parti_simp.Parti_simp()
pscmp.append(partition.Partition(fic="cpgprod.part"))
pscmp[-1].s_name("cpgprod")
pscmp.append(partition.Partition(fic="cpgplot.part"))
pscmp[-1].s_name("cpgplot")
pscmp.append(partition.Partition(fic="cpgis.part"))
pscmp[-1].s_name("cpgis")
pscmp.append(p_fb)
pscmp[-1].s_name("fb")
pscmp.append(p_vit)
pscmp[-1].s_name("vit")
pscmp.append(p_nbc)
pscmp[-1].s_name("mpp")
pscmp.draw_nf("tous_cpg.ps",seg=[0])