Up Next

class Sequence

module sequence

The documentation is here.

This class is used to store sequence data, such as genomic or proteic sequences. It represents a succession of letters, such as

acagcaggcatagacaggatacagatttta.

Positions in a Sequence are numbered from 0 to length-1.

A Sequence can have a name, which is written on the first line in fasta format, after the ">".

Construction

__init__
Optional keyword fic allows construction by reading a file.

Sequence is implemented as a tabular in C++. For construction, the memory for a Sequence must be allocated by:

generate
generates an empty Sequence with a given length;
read_nf
reads a filename in specific or FASTA format; to recognize the formats, a filename in specific format must end with .seq, and a filename in FASTA format must end with .fa or .fst;
read_prop
builds or changes randomly from a Proportion; see Sequence generation.

Optional keywords:

deb=d
changes only after position d (>=0) included;
fin=f
changes only before position f (<len()) included;
long=lg
creates a new lg-length Sequence. In that case, deb and fin are not read;
read_Lprop
builds randomly from Lproportion and returns the resulting Partition; see Sequence generation.

Optional keywords:

deb=d
changes only after position d (>=0) included;
fin=f
changes only before position f (<len()) included;
long=lg
creates a new lg-length Sequence. In that case, deb and fin are not read;
etat_init=e
makes generation beginning with descriptor number e if it is valid. Otherwise, starts with a random descriptor of the Lproportion;
read_Part
builds randomly from a Partition and a Lproportion; each Segment of the Partition must have descriptors numbers, and each number must be the number of a Proportion of the Lproportion. At each position, the Proportion corresponding to the number is used to randomly generate a letter, as for a Sequence generation;
copy
copies deeply from another Sequence.

Handling

__len__
returns the length of the Sequence;
__getitem__ and __getslice__
are implemented, to get respectively characters and sub-sequences.
Beware: operator __getslice__ DOES NOT create a new Sequence object, but only a shallow copy, hence it must be used with care;
__setitem__
is used to change a letter in the Sequence.
__setslice__
is used to change a segment of the Sequence by the letters of a string or a Sequence. BEWARE: if the included part is of the same length as the replaced segment, the Sequence is modified in place, otherwise a new Sequence is built. Hence different behaviours can occur if the replacement is made inside a subsequence. See the example below.

For example:

> import sequence > s=sequence.Sequence(fic="toto.fa") > len(s) 10 > b=s[3:5] > print b 3 ACG > b[2]='A' > print s[3:5] 3 ACA > b[:2]="TT" > print s[3:5] # b and s are still linked 3 TTA > b[:2]="ACG" > print b 4 ACGA > print s[3:5] # b and s are no longer linked 3 TTA #################################
alpha
returns the list of different letters in the Sequence;
shuffle
randomly shuffles the Sequence by (len*(log(len)+1)/2) random transpositions;
g_name
sets the name;
name
returns the name;

Input-Output

Specific format is:

description
length of the sequence
sequence with any spaces and
           returns as wanted
example
20
ACGGGAAGCTAA
AGCTGCG T


__str__
outputs in specific format;
fasta
outputs in fasta format. The name of the Sequence is written on first line after ">";

Optional keyword:

lg=d
sets the length of the lines (default: 80). If null, returns the sequence in one line.
seq
outputs the mere sequence of letters as a string.

Up Next