descriptors and predictions

descriptors and predictions

Simple descriptors

A simple descriptor can be seen as a function applied to a position in a data and its vicinity, and returning a floating-point value.

In a data, on a position, a letter has a value:

in a Sequence, the value of the existing letter is 1, the value of other letters is 0;
in a Matrice, the value of an existing letter is the value in the data, the value of the other letters is 0.

A floating-value written between parentheses after a descriptor multiplies the prediction of this descriptor by that value. For example, on a Sequence, descriptor A returns 1 on A, and 0 elsewhere, whereas descriptor A(0.7) returns 0.7 on A, and 0 elsewhere.

For operators, notation is a prefix one.

The accepted descriptors are:

letters

for letters between a and z and between A and Z, returns the value of the corresponding letter in the data.

letter ::= "a"..."z"|"A"..."Z"

special characters

!	returns 1 in any position (even if out of bounds);
`^`	returns 1 if the position is out of bounds, 0 otherwise.

special ::= "^" | "!"

character codes

for numbers between 0 and 255 included. Character codes of letters are output as letters.

Beware: As the codes of special characters ! and ^ are 33 and 94, these codes must be used very cautiously.

character ::= #0..255

here-plus

returns the sum of the predictions of the descriptors between the parentheses, at this position ; for example +(ABC).

here-plus ::= +(descriptors)

here-mult

returns the product of the predictions of the descriptors between the parentheses, at this position ; for example *(ABC).

here-mult ::= *(descriptors)

here-or

forward

returns the prediction at the current position of the first descriptor between the quotes if the predictions of the next descriptors on the following positions are all positive.
For example, on position 0 of Sequence ACBS, prediction of
`A(0.5)CB(0.3)' returns 0.5.

forward ::= `descriptors'

forward-or

returns the prediction on the current position of a descriptor (the computing descriptor) chosen by the positivity of the prediction of another descriptor (the testing descriptor) on the next position. Each couple testing descriptor-computing descriptor is written in this order. Between the brackets, the tests are made from left to right in the odd descriptors, and stop at the first positive test.
For example, on position 2 of Sequence ACB, prediction of
|`BC(0.1)CC(0.2)AC(0.3)' returns 0.1.
For example, on position 2 of Sequence ACB, prediction of
|`BC(-0.1)BC(0.2)' returns -0.1.

When the computing descriptor is an "or"-operator (here-or, backward-or, or forward-or), the current position for the tests inside this computing descriptor is the preceding preceding. Yet, their joined computing descriptors are used on the actual current position.
For example, on position 1 of Sequence CAB, prediction of
|`B|`BC(0.1)CC(0.2)AC(0.3)'C|`BC(0.4)CC(0.5)AC(0.6)'A|`BC(0.7)CC(0.8)AC(0.9)'' returns 0.7.

forward-or ::= |`descriptorsdescriptors'

backward

returns the prediction at the current position of the last descriptor between the brackets if the predictions of the previous descriptors on the preceding positions are all positive.
For example, on position 3 of Sequence ACBS, prediction of
{A(0.5)CB(0.3)} returns 0.3.

backward ::= {descriptors}

backward-plus

returns the sum of the predictions of the descriptors between the brackets, the last descriptor being applied on the current position, the preceding one on the position before, and so on.
For example, on position 4 of Sequence DABC prediction of
+{A(0.5)B(-0.2)C(1.8)} returns 2.1.

backward-plus ::= +{descriptors}

backward-or

returns the prediction on the current position of a descriptor (the computing descriptor) chosen by the positivity of the prediction of another descriptor (the testing descriptor) on the preceding position. Each couple testing descriptor-computing descriptor is written in this order. Between the brackets, the tests are made from left to right in the odd descriptors, and stop at the first positive test.
For example, on position 3 of Sequence ABC, prediction of
|{BC(0.1)CC(0.2)AC(0.3)} returns 0.1.
For example, on position 3 of Sequence ABC, prediction of
|{BC(-0.1)BC(0.2)} returns -0.1.

When the computing descriptor is an "or"-operator (here-or, backward-or, or forward-or), the current position for the tests inside this computing descriptor is the preceding preceding. Yet, their joined computing descriptors are used on the actual current position.
For example, on position 3 of Sequence ABC, prediction of
|{B|{BC(0.1)CC(0.2)AC(0.3)}C|{BC(0.4)CC(0.5)AC(0.6)}A|{BC(0.7)CC(0.8)AC(0.9)}} returns 0.3.

backward-or ::= |{descriptorsdescriptors}

Nb: these descriptors have been built for specific needs (such as traduction of markovian transition probabilities) but, owing to the C++ implementation, it is very easy to conceive new ones if necessary.

Descriptors patterns

A pattern of descriptors is used in the context of maximum predictive partitionning. It is a word of successive simple descriptors, used periodically to compute predictions on data. The period starts with the first descriptor on the first position.

For example, as the prediction on a data is the sum of the predictions on all the positions of the data, the prediction on sequence ACBCAB of descriptor pattern
AC is 4,
and prediction of descriptor pattern
CA is 0.

Prediction

On a position, the prediction value is the value of the used descriptor.

On a data, the prediction of a simple descriptor is the sum of the predictions on all of the positions of the data.

In the case of a descriptor pattern, the descriptors are used periodically, starting with the first descriptor of the pattern at the first position.
For example, on sequence ACBCAB the prediction of descriptor pattern
AC is 4,
and prediction of descriptor pattern
CA is 0.

Inside a Lexique, when there are transition-costs between descriptors, these costs are used in HMM context, ie in methods fb, backward, forward, and viterbi. In that case, these costs are added to the prediction at each transition between the descriptors.