-
From a sequence, we count the 3-length words.
import sequence
s=sequence.Sequence(fic="lambda.seq")
import compte
c=compte.Compte()
c.add_seq(s,3)
print c
print c.pref(2)
float(c['ag'])/c['a']
- Afterwards, we compute the proportions of letters, given the
letter before. We call them 1|1-proportions.
p=compte.Proportion()
p.read_Compte(c,lprior=1,lpost=1)
print p
Compare the value for "word" A|G
with the rate computed
before.
- The end symbol
^
is not relevant for sequence generation,
so we remove it.
p.read_Compte(c.rstrip(),lprior=1,lpost=1)
print p
- We generate a 10000 letters sequence, given these
1|1-proportions.
s2=sequence.Sequence()
s2.read_prop(p,long=10000)
print s2[:10]
- Let's check the 1|1-proportions in s2
c2=compte.Compte()
c2.add_seq(s2,2)
p2=compte.Proportion()
p2.read_Compte(c2.rstrip(),lprior=1,lpost=1)
print p2
- We translate proportion p into a Lexique.
import descripteur
d=descripteur.Descripteur(1,prop=p)
import lexique
lx=lexique.Lexique()
lx[1]=d
print lx
This rather intricate syntax of
Lexique is explained in the
descriptors part, and the use of
it is detailed in this tutorial part.
- We compute the mean log-likelihood of the markovian model
constructed by p, on both sequences.
print lx.prediction(s)/len(s)
print lx.prediction(s2)/len(s2)
Both are very close.
- Be careful about the beginning of the sequence. In the next
example, a -100000 penalty is due to the lacking proportion for the
beginning of the sequence, and the right command to avoid this
problem.
q=compte.Proportion()
q.read_Compte(c.strip(),lprior=1,lpost=1)
print q # character ^ has been removed
d.read_prop(q)
lx[1]=d
print lx.prediction(s)
print lx.prediction(s,deb=1)
print lx.prediction(s,deb=1)/(len(s)-1)
- And with longer priors:
p3=compte.Proportion()
p3.read_Compte(c.rstrip(),lprior=2,lpost=1)
print p3
d.read_prop(p3)
lx2=lexique.Lexique()
lx2[1]=d
print lx2.prediction(s)/len(s)
print lx2.prediction(s2)/len(s2)
print lx2.prediction(s,deb=2)/(len(s)-2)
print lx2.prediction(s2,deb=2)/(len(s2)-2)
Both log-likelihoods are less close than before.
-
We load the sequence and the
lexique corresponding to the HMM.
import sequence
s=sequence.Sequence(fic="lambda.seq")
len(s)
import lexique
lx=lexique.Lexique(fprop="lprop1")
print lx
- We build and draw the
partition of the states of the
most probable "path" computed with Viterbi algorithm.
import partition
p=partition.Partition()
p.viterbi(s,lx)
len(p)
print p
p.draw_nf("lambda1.ps",num=1)
- We compute the Matrice of the
log-probabilities of the descriptors given the sequence and the
model, by Forward-Backward algorithm.
import matrice
m=matrice.Matrice()
m.fb(s,lx)
print m[:10]
- We build a new partition, in which each segment is made where the most
probable descriptor in former Matrice is the same on a continuous set
of positions.
p2=partition.Partition()
p2.read_Matrice(m)
len(p2)
p2.draw_nf("lambda2.ps",num=1)
- And if we want to draw the partition such that the height of
each arc is proportional with minus the density of log-probability
of the model on this segment.
p2.draw_nf("lambda_val.ps",num=1,func=lambda x: -x.val()/len(x))
- And we want to draw only the segments described by descriptor 3.
p2.draw_nf("lambda_val3.ps",seg=[3],func=lambda x: -x.val()/len(x))
- We compute the proportion of common descriptions between both
partitions.
float(p2.pts_comm(p))/len(s)
- We build the 50-partitioning from the predictions of the
lexique, and draw it.
import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s,lx,50)
ps.draw_nf("lambda_ps.ps",num=1)
- We compute the list of the similarities between the partitions
of ps and both partitions p1 and p2.
l=[]
for i in range(len(ps)):
l.append([i+1,float(ps[i].pts_comm(p))/len(s),float(ps[i].pts_comm(p2))/len(s)])
for i in l:
print i
Notice and compare the best scores and their corresponding number of
segments with the actual number of segments of p and
p2.
-
We count the 1|1-proportions
on the Lsequence built from both
sets of sequences.
import lsequence
ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
import compte
c_cpg=compte.Compte()
for s in ls_cpg:
c_cpg.add_seq(s,2)
c_tem=compte.Compte()
for s in ls_tem:
c_tem.add_seq(s,2)
p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.strip().prop(lprior=1,lpost=1)
print p_tem
- We have to remove the N letters:
p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_cpg
p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
print p_tem
- Then we build a Lproportion
for our HMM:
import lcompte
lp=lcompte.Lproportion()
lp[0]=p_cpg # state 0 for cpg islands
lp[1]=p_tem # state 1 for sequences between of cpg islands
lp.g_inter(0,0,1-1/1000.0)
lp.g_inter(0,1,1/1000.0)
lp.g_inter(1,0,1/125000.0)
lp.g_inter(1,1,1-1/125000.0)
print lp
- To build partitions of the
studied sequence, we have use a
Lexique.
import sequence
s_mus=sequence.Sequence(fic="mus2.fa")
import lexique
lx_hmm=lexique.Lexique()
lx_hmm.read_Lprop(lp)
- The first partition is computed via Viterbi algorithm.
import partition
p_vit=partition.Partition()
p_vit.viterbi(s_mus,lx_hmm)
print p_vit
- The second partition is built via forward-backward algorithm, by
the most probable state on each position. Then, before, the
Matrice of the log-probabilities of
the states must be computed.
import matrice
m_fb=matrice.Matrice()
m_fb.fb(s_mus,lx_hmm)
p_fb=partition.Partition()
p_fb.read_Matrice(m_fb)
print p_fb
- We want to draw these partitions such that the height of each
arc is proportional to the density in C+G.
We use a Lexique with the descriptor C+G.
lx_cg=lexique.Lexique(str="+(CG)")
And we draw the partitions:
p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to draw these partitions such that the height of each
arc is proportional to the value:
number of observed
CpG |
|
number of expected CpG |
|
= |
number of
CpG.length of the segment |
|
number of
C.number of G |
|
We use a Lexique with 3 descriptors: C G and word CG, and define the
function on a lexique, a sequence, and a segment, that returns the ratio.
lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
def height(l,sq, sg):
dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
return float(dl[2]*len(sg))/(dl[1]*dl[0])
And we draw the partitions:
p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
- Now we compute the MPP corresponding to the Lexique; and we draw
it, with the height of each arc proportional with the G+C density.
import parti_simp
ps=parti_simp.Parti_simp()
ps.mpp(s_mus, lx_hmm, 50)
ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
- We want to compare those methods with some specifically made for
CpG islands detection: CpGproD [PM01],
CpGplot [LGLP92] and CpG Island
Searcher [TJ02]. As the outputs of these programs have
different syntaxes, they are converted in several files. Those files
are, respectively, cpgprod.part,
cpgplot.part, and
cpgis.part.
They are all put un a Parti_simp, with the ones computed
with HMM algorithms:
pscmp=parti_simp.Parti_simp()
pscmp.append(partition.Partition(fic="cpgprod.part"))
pscmp[-1].s_name("cpgprod")
pscmp.append(partition.Partition(fic="cpgplot.part"))
pscmp[-1].s_name("cpgplot")
pscmp.append(partition.Partition(fic="cpgis.part"))
pscmp[-1].s_name("cpgis")
pscmp.append(p_fb)
pscmp[-1].s_name("fb")
pscmp.append(p_vit)
pscmp[-1].s_name("vit")
- We compare these partitions with the ones computed with MPP, and
put the closest ones together, with the criterium of the number of
same predictions.
lmax=[]
for p in pscmp:
max=0
imax=0
for i in range(len(ps)):
v=ps[i].pts_comm(p)
if v>max:
imax=i
max=v
lmax.append(imax)
j=1
for i in lmax:
pscmp.insert(j,ps[i])
j+=2
pscmp.draw_nf("tous_cpg.ps",seg=[0])