-
We count the 1|1-proportions
on the Lsequence built from both
sets of sequences.
import lsequence
ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
import compte
c_cpg=compte.Compte()
for s in ls_cpg:
c_cpg.add_seq(s,2)
c_tem=compte.Compte()
for s in ls_tem:
c_tem.add_seq(s,2)
p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.strip().prop(lprior=1,lpost=1)
print p_tem
- We have to remove the N letters:
p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_tem
- Then we build a Lproportion
for our HMM:
import lcompte
lp=lcompte.Lproportion()
lp[0]=p_cpg # state 0 for cpg islands
lp[1]=p_tem # state 1 for sequences between of cpg islands
lp.g_inter(0,0,1-1/1000.0)
lp.g_inter(0,1,1/1000.0)
lp.g_inter(1,0,1/125000.0)
lp.g_inter(1,1,1-1/125000.0)
print lp
- To build partitions of the
studied sequence, we have use a
Lexique.
import sequence
s_mus=sequence.Sequence(fic="mus2.fa")
import lexique
lx_hmm=lexique.Lexique()
lx_hmm.read_Lprop(lp)
- The first partition is computed via Viterbi algorithm.
import partition
p_vit=partition.Partition()
p_vit.viterbi(s_mus,lx_hmm)
print p_vit
- The second partition is built via forward-backward algorithm, by
the most probable state on each position. Then, before, the
Matrice of the log-probabilities of
the states must be computed.
import matrice
m_fb=matrice.Matrice()
m_fb.fb(s_mus,lx_hmm)
p_fb=partition.Partition()
p_fb.read_Matrice(m_fb)
print p_fb
- We want to draw these partitions such that the height of each
arc is proportional to the density in C+G.
We use a Lexique with the descriptor C+G.
lx_cg=lexique.Lexique(str="+(CG)")
And we draw the partitions:
p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to draw these partitions such that the height of each
arc is proportional to the value:
number of observed
CpG |
|
number of expected CpG |
|
= |
number of
CpG.length of the segment |
|
number of
C.number of G |
|
We use a Lexique with 3 descriptors: C G and word CG, and define the
function on a lexique, a sequence, and a segment, that returns the ratio.
lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
def height(l,sq, sg):
dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
return float(dl[2]*len(sg))/(dl[1]*dl[0])
And we draw the partitions:
p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
- Now we compute the MPP corresponding to the Lexique; and we draw
it, with the height of each arc proportional with the G+C density.
import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s_mus, lx_hmm, 50)
ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to compare those methods with some specifically made for
CpG islands detection: CpGproD [PM01],
CpGplot [LGLP92] and CpG Island
Searcher [TJ02]. As the outputs of these programs have
different syntaxes, they are converted in several files. Those files
are, respectively, cpgprod.part,
cpgplot.part, and
cpgis.part.
They are all put un a Parti_simp, with the ones computed
with HMM algorithms:
pscmp=parti_simp.Parti_simp()
pscmp.append(partition.Partition(fic="cpgprod.part"))
pscmp[-1].s_name("cpgprod")
pscmp.append(partition.Partition(fic="cpgplot.part"))
pscmp[-1].s_name("cpgplot")
pscmp.append(partition.Partition(fic="cpgis.part"))
pscmp[-1].s_name("cpgis")
pscmp.append(p_fb)
pscmp[-1].s_name("fb")
pscmp.append(p_vit)
pscmp[-1].s_name("vit")
- We compare these partitions with the ones computed with MPP, and
put the closest ones together, with the criterium of the number of
same predictions.
lmax=[]
for p in pscmp:
max=0
imax=0
for i in range(len(ps)):
v=ps[i].pts_comm(p)
if v>max:
imax=i
max=v
lmax.append(imax)
j=1
for i in lmax:
pscmp.insert(j,ps[i])
j+=2
pscmp.draw_nf("tous_cpg.ps",seg=[0])