-
We count the 1|1-proportions
on the Lsequence built from both
sets of sequences.
import lsequence
ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
import compte
c_cpg=compte.Compte()
for s in ls_cpg:
c_cpg.add_seq(s,2)
c_tem=compte.Compte()
for s in ls_tem:
c_tem.add_seq(s,2)
p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.strip().prop(lprior=1,lpost=1)
print p_tem
- We have to remove the N letters:
p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_tem
- Then we build a Lproportion
for our modelling. We introduce the inter-state probabilities for
the HMM:
import lcompte
lp=lcompte.Lproportion()
lp[0]=p_cpg # state 0 for cpg islands
lp[1]=p_tem # state 1 for sequences between of cpg islands
lp.g_inter(0,0,1-1/1000.0)
lp.g_inter(0,1,1/1000.0)
lp.g_inter(1,0,1/125000.0)
lp.g_inter(1,1,1-1/125000.0)
print lp
- To build partitions of the
studied sequence, we have use a
Lexique.
import sequence
s_mus=sequence.Sequence(fic="mus2.fa")
import lexique
lx_hmm=lexique.Lexique()
lx_hmm.read_Lprop(lp)
- We compute the segmentation probabilities on the sequence from
the models used to build the sequence, up to 200 classes.
lprob=lx_hmm.probability(s_mus,200)
- We compute the logarithm of bayesian factor: the probability of
the sequence under the segmentations in k-segments for all k>=n,
divided by the probability of the sequence under the segmentations
in k-segments for all k<n. And we select the biggest number of
segments that is over 3.2.
import math
def expmean(l): #computes the logarithm of the mean of exponentials
m=max(l)
x=0
for i in l:
x+=math.exp(i-m)
return(math.log(x/len(l))+m)
lbay=[]
for i in range(1,len(lprob)):
lbay.append(expmean(lprob[i:])-expmean(lprob[:i]))
nbc=lbay.index(filter(lambda x: x>=3.2,lbay)[-1])+2
print nbc
- We compute the MPPcorresponding to the Lexique.; and we draw it,
with the height of each arc proportional with the G+C density.
import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s_mus, lx_hmm, nbc+30)
- We want to draw the partitioning such that the height of each
arc is proportional to the density in C+G. So, we use a Lexique with
the descriptor C+G.
lx_cg=lexique.Lexique(str="+(CG)")
ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- And we keep the most relevant partition:
p_nbc=ps[nbc-1]
- The next partition is computed via Viterbi algorithm.
import partition
p_vit=partition.Partition()
p_vit.viterbi(s_mus,lx_hmm)
print p_vit
- The following one is built via forward-backward algorithm, by
the most probable state on each position. Then, before, the
Matrice of the log-probabilities of
the states must be computed.
import matrice
m_fb=matrice.Matrice()
m_fb.fb(s_mus,lx_hmm)
p_fb=partition.Partition()
p_fb.read_Matrice(m_fb)
print p_fb
- And we draw the partitions:
p_nbc.draw_nf("mus_nbc_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to draw these partitions such that the height of each
arc is proportional to the value:
number of observed
CpG |
|
number of expected CpG |
|
= |
number of
CpG.length of the segment |
|
number of
C.number of G |
|
We use a Lexique with 3 descriptors: C G and word CG, and define the
function on a lexique, a sequence, and a segment, that returns the ratio.
lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
def height(l,sq, sg):
dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
return float(dl[2]*len(sg))/(dl[1]*dl[0])
And we draw the partitions:
p_nbc.draw_nf("mus_nbc_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
- We want to compare those methods with some specifically made for
CpG islands detection: CpGproD [PM01],
CpGplot [LGLP92] and CpG Island
Searcher [TJ02]. As the outputs of these programs have
different syntaxes, they are converted in several files. Those files
are, respectively, cpgprod.part,
cpgplot.part, and
cpgis.part.
They are all put un a Parti_simp, with the ones computed
with HMM algorithms:
pscmp=parti_simp.Parti_simp()
pscmp.append(partition.Partition(fic="cpgprod.part"))
pscmp[-1].s_name("cpgprod")
pscmp.append(partition.Partition(fic="cpgplot.part"))
pscmp[-1].s_name("cpgplot")
pscmp.append(partition.Partition(fic="cpgis.part"))
pscmp[-1].s_name("cpgis")
pscmp.append(p_fb)
pscmp[-1].s_name("fb")
pscmp.append(p_vit)
pscmp[-1].s_name("vit")
pscmp.append(p_nbc)
pscmp[-1].s_name("mpp")
pscmp.draw_nf("tous_cpg.ps",seg=[0])