class Proportion

class Proportion

module compte

The documentation is here.

This class is used for proportions of words that follow given words. For example, it can store the fact that the proportions of letters following word AC are:

`A`	`0.34`
`C`	`0.15`
`G`	`0.23`
`T`	`0.28`

Then a distinction is made between prior words (such as AC here), and posterior words (such as A,C,G and T here).

The special character ^ is used to represent beginnings and ends of sequences.

Construction

__init__

Optional keyword fic allows construction by reading from a filename in specific format;

read_nf

builds from a filename in specific format;

read_Compte

Builds from a Compte.

Optional keywords:

lpost=l: specifies the length of the posterior words, ie the words which frequencies are computed. Default: the maximum length of the words;
lprior=l: specifies the length of the prior words, ie the words on which the computed words depend, in a markovian context. Default: 0;

As special character "^" stands for the limits of the sequence, the words terminating with this symbol are counted as " same length or longer than given length "-words;

Handling

__getitem__: returns the string, in format of Compte of the posterior corresponding to the given prior;
__iadd__: adds to self the proportions of another Proportion;
KL_MC: computes KL-distance to a Proportion, by Monte Carlo simulation on several (default:100) Sequence of a given length (default:1000) generated by method read_prop of Sequence. See Sequence generation;
lg_max: returns the length of the longest word, prior or posterior;
alph: returns the list of the letters used in the counts;
next: returns a list of [posterior,proportion] for the specified prior;
has_prefix: returns True if the specified word is a valid prior. Remember that character ^ stands for begin or end of sequence.

Input-Output

Specific format is:

description

lines of

prior|posterior and count separated

by a whitespace

example

A|B 0.3

A|A 0.7

B|B 0.5

B|A 0.5

|A 0.1

|B 0.9

^|A 0.5

^|B 0.5

In this example, following an A, proportion of B is 0.3, and proportion of A is 0.7. Overall proportion of A is 0.1, and of B is 0.9. Proportion of beginning A is 0.5, as well as proportion of beginning B.

__str__: outputs in specific format;
loglex: returns the corresponding Lexique. See read_prop in that class;

Sequence generation

From a Proportion, a (part of a) Sequence can be generated randomly, by the method read_prop.

The process is:

for all increasing positions i:
- get the longest word w ending in i-1 that is a valid prior (using method has_prefix);
- if there is a posterior corresponding to w, let lp be the list of corresponding couples [posterior,proportion] (using method next); otherwise, lp is the list of the uniform distribution of all letters of the Proportion;
- randomly choose a posterior p according to the factors in list lp (see under);
- if the first letter of p is a terminating character (^), put character null ('\0') at that position, and exit;
  otherwise put that letter at position i on the Sequence.

Actually, the random choice of the posterior is made with probabilities proportional to their respective proportions, even if the sum of the proportions is different from 1. Then sequence generation is possible even with non-orthodox proportions.