This page allows for the on-line reproduction of some results from the paper: Perrière, G., Lobry, J.R., Thioulouse, J. (1996) Correspondence discriminant analysis: a multivariate method for comparing classes of protein and nucleic acid sequences. CABIOS, 12:519-524 (CABIOS is now Bioinformatics).
Abstract: This report describes two applications of a multivariate method for studying classes of nucleotide or protein sequences, correspondence discriminant analysis (CDA). The first example is the discrimination between Escherichia coli proteins according to their subcellular location (membrane, cytoplasm and periplasm). The high resolution of the method made it possible to predict the subcellular location of E.coli proteins for whom this information is not known. The second example is discrimination between the coding sequences of leading and lagging strands in four bacteria, Mycoplasma genitalium, Haemophilus influenzae, E.coli and Bacillus subtilis. The programs used for computing the analysis are integrated in a publicly available package that runs on MacOS 7.x or Windows 95 operating systems (http://biomserv.univ-lyon1.fr/ADE-4.html). These programs are also accessible through our World Wide Web server (http://biomserv.univ-lyon1.fr/NetMul.html).
Factorial map of the the two discriminant axes of the analysis on 413 E. coli proteins. Each protein is represented by a dot linked by a line to the gravity center of the group it belongs to. The first axis discriminates Membrane Proteins (MP) from Cytoplasmic Proteins (CP) and Periplasmic Proteins (PP), while the second axis discriminates PP from CP and MP.
Factor scores for the amino acids on the two axes of the discriminant analysis on 413 E. coli proteins and example of protein factor score computation. Columns Ai1 and Ai2 contain the amino acid factor scores on the two discriminant axes, Ni. contains the absolute amino acid frequencies in the whole data set, and Nij (V4) contains the absolute amino acid frequencies in protein AraJ (P23910). The factor score of AraJ on the two axes of the analysis is computed using equation (2), with N.. and N.j respectively equal to the sum of the Ni. and the Nij columns of the table. The threshold value between MP/non-MP is equal to -0.024 and the threshold value between PP/non-PP is equal to 0.617
Ai1 Ai2 Ni. Nij Ai1*Nij/Ni. Ai2*Nij/Ni. Arg -0.0224444307 -0.19632825 7694 11 -3.208848e-05 -2.806876e-04 Ala -0.0008848707 -0.07402600 16280 49 -2.663308e-06 -2.228055e-04 Gln 0.2282808512 0.49864634 6436 7 2.482856e-04 5.423437e-04 Cys -0.1147018450 -0.21666135 1399 5 -4.099423e-04 -7.743436e-04 Leu 0.2386590780 -0.26550447 17383 57 7.825788e-04 -8.706066e-04 Gly 0.2409532279 -0.15998687 13083 44 8.103602e-04 -5.380587e-04 His -0.4109746141 -0.06306444 3204 5 -6.413462e-04 -9.841517e-05 Phe 0.1555698783 -0.43662726 7576 29 5.955024e-04 -1.671356e-03 Ser 0.1349925385 0.56593170 9296 28 4.066040e-04 1.704614e-03 Val 0.2826027346 0.04519071 12549 28 6.305583e-04 1.008319e-04 Glu -0.7152045402 -1.45179243 8601 7 -5.820755e-04 -1.181554e-03 Ile 0.4126975144 -0.82137277 10392 28 1.111964e-03 -2.213091e-03 Thr 0.0830666145 0.03665992 8912 15 1.398114e-04 6.170319e-05 Lys -0.2080301637 1.50979787 7381 12 -3.382146e-04 2.454623e-03 Asp -0.8409564915 -0.23896277 7931 4 -4.241364e-04 -1.205209e-04 Met 0.2340209938 0.07286442 5122 22 1.005166e-03 3.129670e-04 Pro -0.2304634610 0.93564105 7066 14 -4.566216e-04 1.853803e-03 Asn 0.1506997535 0.33397438 6226 10 2.420491e-04 5.364189e-04 Tyr 0.1799142335 -0.29107730 4894 15 5.514331e-04 -8.921454e-04 Trp 0.2022029988 0.21669752 2774 4 2.915689e-04 3.124694e-04 Sum NA NA 164199 394 3.928794e-03 -9.838096e-04 Fjk NA NA NA NA 1.637320e+00 -4.100014e-01
Note that there is a problem here: the results are not exactly the same as in the paper. The total number of amino-acids is 164,199 here versus 164,879 in the paper. 680 amino-acids have been lost somewhere.
Distribution of the factor scores on the discriminant axis of the coding sequences belonging to the leading and lagging groups.
Discriminant power of codons. Each point represents the discriminant score of one codon, a positive value means that the codon is more frequent in leading than in lagging coding sequences. Codons are grouped by amino acids according to the one-letter code at the bottom of the figure. White dots represent codons with a keto base (G or T) in their third codon position, while red dots represents codons with an amino base (A or C).