-
From a sequence, we count the 3-length words.
import sequence
s=sequence.Sequence(fic="lambda.seq")
import compte
c=compte.Compte()
c.add_seq(s,3)
print c
print c.pref(2)
float(c['ag'])/c['a']
- Afterwards, we compute the proportions of letters, given the
letter before. We call them 1|1-proportions.
p=compte.Proportion()
p.read_Compte(c,lprior=1,lpost=1)
print p
Compare the value for "word" A|G
with the rate computed
before.
- The end symbol
^
is not relevant for sequence generation,
so we remove it.
p.read_Compte(c.rstrip(),lprior=1,lpost=1)
print p
- We generate a 10000 letters sequence, given these
1|1-proportions.
s2=sequence.Sequence()
s2.read_prop(p,long=10000)
print s2[:10]
- Let's check the 1|1-proportions in s2
c2=compte.Compte()
c2.add_seq(s2,2)
p2=compte.Proportion()
p2.read_Compte(c2.rstrip(),lprior=1,lpost=1)
print p2
- We translate proportion p into a Lexique.
import lexique
lx=lexique.Lexique()
lx.read_prop(p)
print lx
This rather intricate syntax of
Lexique is explained in the
descriptors part, and the use of
it is detailed in this tutorial part.
- We compute the mean log-likelihood of the markovian model
constructed by p, on both sequences.
lx.prediction(s)/len(s)
lx.prediction(s2)/len(s2)
Both are very close.
- Be careful about the beginning of the sequence. In the next
example, a -100000 penalty is due to the lacking proportion for the
beginning of the sequence, and the right command to avoid this
problem.
q=compte.Proportion()
q.read_Compte(c.strip(),lprior=1,lpost=1)
print q # character ^ has been removed
lx.read_prop(q)
lx.prediction(s)
lx.prediction(s,deb=1)
lx.prediction(s,deb=1)/(len(s)-1)
- And with longer priors:
p3=compte.Proportion()
p3.read_Compte(c.rstrip(),lprior=2,lpost=1)
print p3
lx2=lexique.Lexique()
lx2.read_prop(p3)
lx2.prediction(s)/len(s)
lx2.prediction(s2)/len(s2)
lx2.prediction(s,deb=2)/(len(s)-2)
lx2.prediction(s2,deb=2)/(len(s2)-2)
Both log-likelihoods are less close than before.