Codon usage in EMGLib


To estimate codon usage bias in the species introduced in EMGLib, we used the well-known Codon Adaptation Index (CAI) defined by Sharp and Li (1987). The formula for computing CAI is:

CAI formula

where fi is the relative frequency of codon i in the coding sequence, and Wi the ratio of the frequency of codon i to the frequency of the major codon for the same amino-acid, as estimated from examining highly expressed genes in the species considered.

Method

To build the CAI tables used in EMGLib, we computed for each species a correspondance qnalysis on the absolute codon frequencies of all the genes with a length greater or equal to 150 nucleotides. Then we located the positions of the genes coding for ribosomal proteins on CA factor maps, and we used them as probes to locate on CA factor maps the highly expressed genes.

After identifying the factor(s) separating ribosomal protein genes from the others, codon frequencies of the genes with the highest score(s) on this (these) factors were used to build the CAI reference tables. Care was taken to use the same number of codons (#5000) from the leading and the lagging strand, this to avoid that our CAI values represent a strand index rather than an expressivity index. Indeed codon usage is known to be dependant on the strand on which the sequences are located.

Results

Up to now, we have employed this method to compute CAI tables for the eight eubacterial species where it was possible to separate the genes following the strand on which they were located. In the following table, the first column contains links to the files giving the sequence names in EMGLib of the genes used to build the CAI table; the second column gives access to the CAI reference tables; and the third column to a picture (in GIF format) of the the factors maps used to detect the highly expressed genes in the different species.

Genes Table Factor map
B.burgdorferi bb-list.txt bb-cai.txt bb-ca.gif
B.subtilis bs-list.txt bs-cai.txt bs-ca.gif
E.coli ec-list.txt ec-cai.txt ec-ca.gif
H.influenzae hi-list.txt hi-cai.txt hi-ca.gif
H.pylori hp-list.txt hp-cai.txt hp-ca.gif
M.genitalium mg-list.txt mg-cai.txt mg-ca.gif
M.pneumoniae mp-list.txt mp-cai.txt mp-ca.gif
M.tuberculosis mt-list.txt mt-cai.txt mt-ca.gif

On the CAI reference tables, the first column lists the amino acids by decreasing order of their number of synonymous codons (i.e., amino acids encoded by sextets are listed first, then the amino acids encoded by quartets, etc.) The second column lists the codons. The third column contains the value of ln(Wi ). The last column gives the absolute frequencies of codons in the data set. As some codons were not found in the genes of some species, we assigned a value of 0.5 to their frequency in a way to compute the value of ln(Wi ).

On the factor maps, the genes coding for ribosomal proteins are shown by red circles while the other genes are shown by yellow crosses. The factors used are given in each picture.


If you have problems or comments...

Back to PBIL home page