The sequences introduced in EMGLib are taken from the genome division of GenBank, excepted the one from B.subtilis, which is taken from the NRSub database. We perform many corrections and additions on the original GenBank genome entries, and these modifications are summarized here.
LOCUS field).
The new names are based on the format xxxxxCG where
xxxxx stands for an abreviation of the systematic name of
the organism based on the reference established in
SWISS-PROT
(e.g., BACSUCG
is the name of the B.subtilis genome entry). In the case more than one
chromosome exist in the organism, we replace CG by Cn
where n is the number of the chromosome. We also change the GenBank
accession numbers (ACCESSION field) to our own ones, which are based
on the format CGXXXX
(e.g., CG0001).
/strand qualifier. To predict the location of
replication origin and terminus, we use the method of
Lobry (1996), based on the existence of asymmetric
substitution patterns between the two strands of
chromosome in eubacteria.
Data on codon usage bias are introduced through the use of the
Codon Adaptation Index (CAI). Even if CAI reference tables were already published
for some of the organisms introduced in EMGLib, we decided to establish our
own tables. The values computed for each CDS are
added under a /CAI qualifier.
Cross-references to other sequence databases (nucleotide or protein) are added
under a /db_xref qualifier. The content of the /product
qualifier is corrected or completed using data from SWISS-PROT. When an encoded
protein is an enzyme, we add its EC number, taken from the ENZYME database, under
a /EC_number qualifier. At last, when the gene was known to
belong to a family defined in the HOBACGEN database, we add the accession
number of this family under a /gene_family qualifier.
/protein_bind, /promoter and /operon qualifiers.