Counting on a sequence

1 Counting on a sequence

In this section, we compute proportions of words on a sequence, from file in specific format lambda.seq, generate randomly a new sequence given these proportions, and compute the likelihood of models on both sequences.

From a sequence, we count the 3-length words.
import sequence s=sequence.Sequence(fic="lambda.seq") import compte c=compte.Compte() c.add_seq(s,3) print c print c.pref(2) float(c['ag'])/c['a']
Afterwards, we compute the proportions of letters, given the letter before. We call them 1|1-proportions.
p=compte.Proportion() p.read_Compte(c,lprior=1,lpost=1) print p
Compare the value for "word" A|G with the rate computed before.
The end symbol ^ is not relevant for sequence generation, so we remove it.
p.read_Compte(c.rstrip(),lprior=1,lpost=1) print p
We generate a 10000 letters sequence, given these 1|1-proportions.
s2=sequence.Sequence() s2.read_prop(p,long=10000) print s2[:10]
Let’s check the 1|1-proportions in s2
c2=compte.Compte() c2.add_seq(s2,2) p2=compte.Proportion() p2.read_Compte(c2.rstrip(),lprior=1,lpost=1) print p2
We translate proportion p into a Lexique.
import descripteur d=descripteur.Descripteur(1,prop=p) import lexique lx=lexique.Lexique() lx[1]=d print lx
This rather intricate syntax of Lexique is explained in the descriptors part, and the use of it is detailed in this tutorial part.
We compute the mean log-likelihood of the markovian model constructed by p, on both sequences.
print lx.prediction(s)/len(s) print lx.prediction(s2)/len(s2)
Both are very close.
Be careful about the beginning of the sequence. In the next example, a -100000 penalty is due to the lacking proportion for the beginning of the sequence, and the right command to avoid this problem.
q=compte.Proportion() q.read_Compte(c.strip(),lprior=1,lpost=1) print q # character ^ has been removed d.read_prop(q) lx[1]=d print lx.prediction(s) print lx.prediction(s,deb=1) print lx.prediction(s,deb=1)/(len(s)-1)
And with longer priors:
p3=compte.Proportion() p3.read_Compte(c.rstrip(),lprior=2,lpost=1) print p3 d.read_prop(p3) lx2=lexique.Lexique() lx2[1]=d print lx2.prediction(s)/len(s) print lx2.prediction(s2)/len(s2) print lx2.prediction(s,deb=2)/(len(s)-2) print lx2.prediction(s2,deb=2)/(len(s2)-2)
Both log-likelihoods are less close than before.