Up Next

1  Counting on a sequence



In this section, we compute proportions of words on a sequence, from file in specific format lambda.seq, generate randomly a new sequence given these proportions, and compute the likelihood of models on both sequences.
  1. From a sequence, we count the 3-length words.
    import sequence
    s=sequence.Sequence(fic="lambda.seq")
    import compte
    c=compte.Compte()
    c.add_seq(s,3)
    print c
    print c.pref(2)
    float(c['ag'])/c['a']
    


  2. Afterwards, we compute the proportions of letters, given the letter before. We call them 1|1-proportions.
    p=compte.Proportion()
    p.read_Compte(c,lprior=1,lpost=1)
    print p
    
    Compare the value for "word" A|G with the rate computed before.

  3. The end symbol ^ is not relevant for sequence generation, so we remove it.
    p.read_Compte(c.rstrip(),lprior=1,lpost=1)
    print p
    


  4. We generate a 10000 letters sequence, given these 1|1-proportions.
    s2=sequence.Sequence()
    s2.read_prop(p,long=10000)
    print s2[:10]
    


  5. Let's check the 1|1-proportions in s2
    c2=compte.Compte()
    c2.add_seq(s2,2)
    p2=compte.Proportion()
    p2.read_Compte(c2.rstrip(),lprior=1,lpost=1)
    print p2
    


  6. We translate proportion p into a Lexique.
    import descripteur
    d=descripteur.Descripteur(1,prop=p)
    import lexique
    lx=lexique.Lexique()
    lx[1]=d
    print lx
    
    This rather intricate syntax of Lexique is explained in the descriptors part, and the use of it is detailed in this tutorial part.

  7. We compute the mean log-likelihood of the markovian model constructed by p, on both sequences.
    print lx.prediction(s)/len(s)
    print lx.prediction(s2)/len(s2)
    
    Both are very close.

  8. Be careful about the beginning of the sequence. In the next example, a -100000 penalty is due to the lacking proportion for the beginning of the sequence, and the right command to avoid this problem.
    q=compte.Proportion()
    q.read_Compte(c.strip(),lprior=1,lpost=1)
    print q        # character ^ has been removed
    d.read_prop(q)
    lx[1]=d
    print lx.prediction(s)
    print lx.prediction(s,deb=1)   
    print lx.prediction(s,deb=1)/(len(s)-1)
    


  9. And with longer priors:
    p3=compte.Proportion()
    p3.read_Compte(c.rstrip(),lprior=2,lpost=1)
    print p3
    d.read_prop(p3)
    lx2=lexique.Lexique()
    lx2[1]=d
    print lx2.prediction(s)/len(s)
    print lx2.prediction(s2)/len(s2)
    print lx2.prediction(s,deb=2)/(len(s)-2)
    print lx2.prediction(s2,deb=2)/(len(s2)-2)
    
    Both log-likelihoods are less close than before.

Up Next