PBIL

Correspondence Analysis

BBE contribution to PBIL in Lyon, France


Aim of the method

Correspondence Analysis (CA) is a method suited for studying differenciation among codons or for examining trends in amino-acids composition. For instance, it is possible to use CA to detect highly expressed genes in unicellular organisms (e.g., bacteria, yeast), or in some simple multicellular organisms (e.g., nematod, fruitfly). CA computations on the PBIL server are realized through a Web implementation of R called Rweb. This page describes (with an example) the different steps required to do this kind of analysis on a set of sequences selected from the server.

Example

These commands allow you to compute a CA on codon composition for all the Coding DNA Sequences (CDS) from Mycoplasma genitalium and to display the factor map crossing the two first axes for the genes and the codons. Just click on the "Submit" button below to see the result.

You can modify these parameters in order to compute CA on other species, with other data banks or with different options for the graphics.

R code explanations

Selection and retrieval of a set of sequences

The first step consists in selecting the sequences for which we want to compute the codon usage:
    choosebank(bank = "emglib")
    req <- query("liste", "sp=mycoplasma genitalium et t=cds")
    seqs <- lapply(liste$req[], getSequence)
The first line selects the data bank in which we want to retrieve the sequences to be analysed (in this example, EMGLib). The complete list of the different banks that are accessible is available here. The second line performs a query allowing to retrieve the list of all the CDS from M. genitalium. Information on how to compose queries in order to retrieve sequences through the SeqinR interface is available here. At last, the third line retrieves the sequences themselves.

Analysis computation

Then the analysis is computed:
    tabco <- lapply(seqs, uco)
    tabco <- as.data.frame(lapply(tabco, as.vector), row.names = names(tabco[[1]]))    
    names(tabco) <- liste$req[]
    ca <- dudi.coa(tabco, scan = F, nf = 3)
Function uco calculates the codons counts for all the sequences stored in seqs. Note that this function can also calculates relative frequencies (for more information, see the uco documentation page). The variable tabco containing the counts is then transformed into a data frame, in order to be used by ADE-4. The last line corresponds to the computation of CA itself. The options used mean that only the three first axes of the analysis have to be taken into consideration (see the dudi.coa documentation page).

Graphics plotting

In order to visualize the results, the following lines allow to plot some graphics:
    s.label(ca$co, clabel = 0, sub = "Genes F1xF2 map")
    s.label(ca$li, sub = "Codons F1xF2 map")
Here, the factor maps crossing the two first axes for the genes (first line) and the codons (second line) are displayed. In the case of the codons map, labels giving the corresponding sequences have been added. We can see that the first axis separates the GC-ending codons from the AT-ending ones. Therefore, the trend represented by this axis is probably the GC-content of the genes.


If you have problems or comments...

Back to PBIL home page