EMGLib documentation


The sequences introduced in EMGLib are taken from the genome division of GenBank, excepted the one from B.subtilis, which is taken from the NRSub database. We perform many corrections and additions on the original GenBank genome entries, and these modifications are summarized here.

Names and accession numbers

First, new identifiers are given for each genome (LOCUS field). The new names are based on the format xxxxxCG where xxxxx stands for an abreviation of the systematic name of the organism based on the reference established in SWISS-PROT (e.g., BACSUCG is the name of the B.subtilis genome entry). In the case more than one chromosome exist in the organism, we replace CG by Cn where n is the number of the chromosome. We also change the GenBank accession numbers (ACCESSION field) to our own ones, which are based on the format CGXXXX (e.g., CG0001).

CDS features

Features for Coding DNA Sequences (CDS) are completed with various informations. If the location of the replication origin and terminus are known or could be predicted, we add the orientation of the CDS on the chromosome (leading or lagging) under a /strand qualifier. To predict the location of replication origin and terminus, we use the method of Lobry (1996), based on the existence of asymmetric substitution patterns between the two strands of chromosome in eubacteria.

Data on codon usage bias are introduced through the use of the Codon Adaptation Index (CAI). Even if CAI reference tables were already published for some of the organisms introduced in EMGLib, we decided to establish our own tables. The values computed for each CDS are added under a /CAI qualifier.

Cross-references to other sequence databases (nucleotide or protein) are added under a /db_xref qualifier. The content of the /product qualifier is corrected or completed using data from SWISS-PROT. When an encoded protein is an enzyme, we add its EC number, taken from the ENZYME database, under a /EC_number qualifier. At last, when the gene was known to belong to a family defined in the HOBACGEN database, we add the accession number of this family under a /gene_family qualifier.

Operons features

In the case of E.coli we have added the data on regulatory sites, promoters and operon predictions provided by the Regulon database. These data are available respectively under the /protein_bind, /promoter and /operon qualifiers.


If you have problems or comments...

Back to PBIL home page