Counting on a sequence

1 Counting on a sequence

In this section, we compute proportions of words on a sequence, from file in specific format lambda.seq, generate randomly a new sequence given these proportions, and compute the likelihood of models on both sequences.

From a sequence, we count the 3-length words.

import sequence
s=sequence.Sequence(fic="lambda.seq")
import compte
c=compte.Compte()
c.add_seq(s,3)
print c
print c.pref(2)
float(c['ag'])/c['a']

Afterwards, we compute the proportions of letters, given the letter before. We call them 1|1-proportions.
```
p=compte.Proportion()
p.read_Compte(c,lprior=1,lpost=1)
print p
```
Compare the value for "word" A|G with the rate computed before.
The end symbol ^ is not relevant for sequence generation, so we remove it.
```
p.read_Compte(c.rstrip(),lprior=1,lpost=1)
print p
```
We generate a 10000 letters sequence, given these 1|1-proportions.
```
s2=sequence.Sequence()
s2.read_prop(p,long=10000)
print s2[:10]
```

Let's check the 1|1-proportions in s2

c2=compte.Compte()
c2.add_seq(s2,2)
p2=compte.Proportion()
p2.read_Compte(c2.rstrip(),lprior=1,lpost=1)
print p2

We translate proportion p into a Lexique.
```
import lexique
lx=lexique.Lexique()
lx.read_prop(p)
print lx
```
This rather intricate syntax of Lexique is explained in the descriptors part, and the use of it is detailed in this tutorial part.
We compute the mean log-likelihood of the markovian model constructed by p, on both sequences.
```
lx.prediction(s)/len(s)
lx.prediction(s2)/len(s2)
```
Both are very close.

Be careful about the beginning of the sequence. In the next example, a -100000 penalty is due to the lacking proportion for the beginning of the sequence, and the right command to avoid this problem.

q=compte.Proportion()
q.read_Compte(c.strip(),lprior=1,lpost=1)
print q        # character ^ has been removed
lx.read_prop(q)
lx.prediction(s)
lx.prediction(s,deb=1)   
lx.prediction(s,deb=1)/(len(s)-1)

And with longer priors:

p3=compte.Proportion()
p3.read_Compte(c.rstrip(),lprior=2,lpost=1)
print p3
lx2=lexique.Lexique()
lx2.read_prop(p3)
lx2.prediction(s)/len(s)
lx2.prediction(s2)/len(s2)
lx2.prediction(s,deb=2)/(len(s)-2)
lx2.prediction(s2,deb=2)/(len(s2)-2)

Both log-likelihoods are less close than before.