| Correspondence Analysis |
You can modify these parameters in order to compute CA on other species, with other data banks or with different options for the graphics.
choosebank(bank = "emglib")
req <- query("liste", "sp=mycoplasma genitalium et t=cds")
seqs <- lapply(liste$req[], getSequence)
The first line selects the data bank in which we want to retrieve the sequences
to be analysed (in this example, EMGLib). The
complete list of the different banks that are accessible is available
here. The second line performs a query
allowing to retrieve the list of all the CDS from M. genitalium.
Information on how to compose queries in order to retrieve sequences through the SeqinR
interface is available
here. At last, the
third line retrieves the sequences themselves.
tabco <- lapply(seqs, uco)
tabco <- as.data.frame(lapply(tabco, as.vector), row.names = names(tabco[[1]]))
names(tabco) <- liste$req[]
ca <- dudi.coa(tabco, scan = F, nf = 3)
Function uco calculates the codons counts for all the sequences
stored in seqs. Note that this function can also calculates
relative frequencies (for more information, see the
uco
documentation page). The variable tabco containing the counts
is then transformed into a data frame, in order to be used by ADE-4. The
last line corresponds to the computation of CA itself. The options used mean
that only the three first axes of the analysis have to be taken into
consideration (see the
dudi.coa
documentation page).
s.label(ca$co, clabel = 0, sub = "Genes F1xF2 map")
s.label(ca$li, sub = "Codons F1xF2 map")
Here, the factor maps crossing the two first axes for the
genes (first line) and the codons (second line) are displayed.
In the case of the codons map, labels giving the corresponding
sequences have been added. We can see that the first axis separates
the GC-ending codons from the AT-ending ones. Therefore, the trend
represented by this axis is probably the GC-content of the genes.