Perrière & Lobry & Thioulouse (1996) CABIOS 12:519-524

This page allows for the on-line reproduction of some results from the paper: Perrière, G., Lobry, J.R., Thioulouse, J. (1996) Correspondence discriminant analysis: a multivariate method for comparing classes of protein and nucleic acid sequences. CABIOS, 12:519-524 (CABIOS is now Bioinformatics).

Abstract: This report describes two applications of a multivariate method for studying classes of nucleotide or protein sequences, correspondence discriminant analysis (CDA). The first example is the discrimination between Escherichia coli proteins according to their subcellular location (membrane, cytoplasm and periplasm). The high resolution of the method made it possible to predict the subcellular location of E.coli proteins for whom this information is not known. The second example is discrimination between the coding sequences of leading and lagging strands in four bacteria, Mycoplasma genitalium, Haemophilus influenzae, E.coli and Bacillus subtilis. The programs used for computing the analysis are integrated in a publicly available package that runs on MacOS 7.x or Windows 95 operating systems (http://biomserv.univ-lyon1.fr/ADE-4.html). These programs are also accessible through our World Wide Web server (http://biomserv.univ-lyon1.fr/NetMul.html).

Protein data set

Figure 1

Factorial map of the the two discriminant axes of the analysis on 413 E. coli proteins. Each protein is represented by a dot linked by a line to the gravity center of the group it belongs to. The first axis discriminates Membrane Proteins (MP) from Cytoplasmic Proteins (CP) and Periplasmic Proteins (PP), while the second axis discriminates PP from CP and MP.

Table 1

Factor scores for the amino acids on the two axes of the discriminant analysis on 413 E. coli proteins and example of protein factor score computation. Columns A_i¹ and A_i² contain the amino acid factor scores on the two discriminant axes, N_i. contains the absolute amino acid frequencies in the whole data set, and N_ij (V4) contains the absolute amino acid frequencies in protein AraJ (P23910). The factor score of AraJ on the two axes of the analysis is computed using equation (2), with N_.. and N_.j respectively equal to the sum of the N_i. and the N_ij columns of the table. The threshold value between MP/non-MP is equal to -0.024 and the threshold value between PP/non-PP is equal to 0.617

              Ai1         Ai2    Ni. Nij   Ai1*Nij/Ni.   Ai2*Nij/Ni.
Arg -0.0224444307 -0.19632825   7694  11 -3.208848e-05 -2.806876e-04
Ala -0.0008848707 -0.07402600  16280  49 -2.663308e-06 -2.228055e-04
Gln  0.2282808512  0.49864634   6436   7  2.482856e-04  5.423437e-04
Cys -0.1147018450 -0.21666135   1399   5 -4.099423e-04 -7.743436e-04
Leu  0.2386590780 -0.26550447  17383  57  7.825788e-04 -8.706066e-04
Gly  0.2409532279 -0.15998687  13083  44  8.103602e-04 -5.380587e-04
His -0.4109746141 -0.06306444   3204   5 -6.413462e-04 -9.841517e-05
Phe  0.1555698783 -0.43662726   7576  29  5.955024e-04 -1.671356e-03
Ser  0.1349925385  0.56593170   9296  28  4.066040e-04  1.704614e-03
Val  0.2826027346  0.04519071  12549  28  6.305583e-04  1.008319e-04
Glu -0.7152045402 -1.45179243   8601   7 -5.820755e-04 -1.181554e-03
Ile  0.4126975144 -0.82137277  10392  28  1.111964e-03 -2.213091e-03
Thr  0.0830666145  0.03665992   8912  15  1.398114e-04  6.170319e-05
Lys -0.2080301637  1.50979787   7381  12 -3.382146e-04  2.454623e-03
Asp -0.8409564915 -0.23896277   7931   4 -4.241364e-04 -1.205209e-04
Met  0.2340209938  0.07286442   5122  22  1.005166e-03  3.129670e-04
Pro -0.2304634610  0.93564105   7066  14 -4.566216e-04  1.853803e-03
Asn  0.1506997535  0.33397438   6226  10  2.420491e-04  5.364189e-04
Tyr  0.1799142335 -0.29107730   4894  15  5.514331e-04 -8.921454e-04
Trp  0.2022029988  0.21669752   2774   4  2.915689e-04  3.124694e-04
Sum            NA          NA 164199 394  3.928794e-03 -9.838096e-04
Fjk            NA          NA     NA  NA  1.637320e+00 -4.100014e-01

# There is a problem with this file: entries 16:(16+14) are duplicated
# with entries 32:(32+14)
EcMP <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcMP.fra")
EcMPnames <- readLines("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcMP.lst")
rownames(EcMP) <- paste(EcMPnames, 1:length(EcMPnames), sep = "")
#
EcCP <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcCP.fra")
EcCPnames <- readLines("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcCP.lst")
rownames(EcCP) <- EcCPnames
#
EcPP <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcPP.fra")
EcPPnames <- readLines("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcPP.lst")
rownames(EcPP) <- EcPPnames
#
Ec <- rbind(EcMP, EcCP, EcPP)
# This is deduced from file Prot/EcAA.difa
names(Ec) <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcAA.difa")$V1
#
# Need Sup data for AraJ (P23910)
#
Sup <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/Sup.fra")
Supnames <- readLines("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/Sup.lst")
rownames(Sup) <- Supnames
names(Sup) <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/Prot/EcAA.difa")$V1
locfac <- factor(rep(c("MP","CP","PP"),c(nrow(EcMP), nrow(EcCP), nrow(EcPP))))
#
# Run CDA
#
library(ade4)
afc <- dudi.coa(Ec, scann = FALSE, nf = 2)
cda <- discrimin(afc, locfac, scann = FALSE, nf = 2)
#
# Make table 1
#
table1 <- cbind(cda$fa[,1:2], colSums(Ec), t(Sup["P23910|ARAJ_ECOLI", ]), cda$fa[,1]*t(Sup["P23910|ARAJ_ECOLI", ])/colSums(Ec), cda$fa[,2]*t(Sup["P23910|ARAJ_ECOLI", ])/colSums(Ec))
names(table1) <- c("Ai1","Ai2","Ni.","Nij","Ai1*Nij/Ni.","Ai2*Nij/Ni.")
table1[21,] <- colSums(table1)
table1[21,1:2] <- c(NA,NA)
rownames(table1)[21] <- "Sum"
table1[22,] <- table1["Sum",]*sum(Ec)/sum(Sup["P23910|ARAJ_ECOLI", ])
table1[22,3:4] <- c(NA,NA)
rownames(table1)[22] <- "Fjk"
table1

Note that there is a problem here: the results are not exactly the same as in the paper. The total number of amino-acids is 164,199 here versus 164,879 in the paper. 680 amino-acids have been lost somewhere.

Codon data set

Introduction

The data in the Nucl folder are imported under R with this script. The correspondence discriminant analyses for the four species are run with this script.

Figure 2

Distribution of the factor scores on the discriminant axis of the coding sequences belonging to the leading and lagging groups.

# Import data:
source("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/readNucl.r")
# Run analyses:
source("http://pbil.univ-lyon1.fr/members/lobry/repro/cabios96/runNucl.r")
# Plot result:
opar <- par(no.readonly = TRUE)
par(mfrow = c(2,2), mar = c(0.1,0.1,0.1,0.1))
#
# Mycoplasma genitalim:
#
xmin <- -3.8
xmax <- +3.7
xseq <- seq(from = xmin, to = xmax, length = 255)
plot(xseq, dnorm(xseq, mean = mean(Mgcda$li[Mgfac == "lea", 1]), 
sd = sd(Mgcda$li[Mgfac == "lea", 1])),
ylim = c(0,1), type = "l", xaxs = "i", xaxt = "n", yaxt = "n", yaxs = "i", lwd = 2)
lines(xseq, dnorm(xseq, mean = mean(Mgcda$li[Mgfac == "lag", 1]), 
sd = sd(Mgcda$li[Mgfac == "lag", 1])), col = "red", lwd = 2, lty = 2)
abline(v = 0)
text(x = xmin, y = 0.8, expression(italic("Mycoplasma\ngenitalium")), pos = 4, cex = 2)
legend("topright", inset = 0.01, c("leading", "lagging"), lty = 1:2, col = c("black", "red"))
#
# Haemophilus influenzae:
#
xmin <- -3.3
xmax <- +3.1
xseq <- seq(from = xmin, to = xmax, length = 255)
plot(xseq, dnorm(xseq, mean = mean(Hicda$li[Hifac == "lea", 1]), 
sd = sd(Hicda$li[Hifac == "lea", 1])),
ylim = c(0,1), type = "l", xaxs = "i", xaxt = "n", yaxt = "n", yaxs = "i", lwd = 2)
lines(xseq, dnorm(xseq, mean = mean(Hicda$li[Hifac == "lag", 1]), 
sd = sd(Hicda$li[Hifac == "lag", 1])), col = "red", lwd = 2, lty = 2)
abline(v = 0)
text(x = xmin, y = 0.8, expression(italic("Haemophilus\ninfluenzae")), pos = 4, cex = 2)
legend("topright", inset = 0.01, c("leading", "lagging"), lty = 1:2, col = c("black", "red"))
#
# Escherichia coli:
#
xmin <- -4.9
xmax <- +4.7
xseq <- seq(from = xmin, to = xmax, length = 255)
plot(xseq, dnorm(xseq, mean = mean(Eccda$li[Ecfac == "lea", 1]), 
sd = sd(Eccda$li[Ecfac == "lea", 1])),
ylim = c(0,1), type = "l", xaxs = "i", xaxt = "n", yaxt = "n", yaxs = "i", lwd = 2)
lines(xseq, dnorm(xseq, mean = mean(Eccda$li[Ecfac == "lag", 1]), 
sd = sd(Eccda$li[Ecfac == "lag", 1])), col = "red", lwd = 2, lty = 2)
abline(v = 0)
text(x = xmin, y = 0.8, expression(italic("Escherichia\ncoli")), pos = 4, cex = 2)
legend("topright", inset = 0.01, c("leading", "lagging"), lty = 1:2, col = c("black", "red"))
#
# Bacillus subtilis:
#
xmin <- -3.3
xmax <- +3.1
xseq <- seq(from = xmin, to = xmax, length = 255)
plot(xseq, dnorm(xseq, mean = mean(Bscda$li[Bsfac == "lea", 1]), 
sd = sd(Bscda$li[Bsfac == "lea", 1])),
ylim = c(0,1), type = "l", xaxs = "i", xaxt = "n", yaxt = "n", yaxs = "i", lwd = 2)
lines(xseq, dnorm(xseq, mean = mean(Bscda$li[Bsfac == "lag", 1]), 
sd = sd(Bscda$li[Bsfac == "lag", 1])), col = "red", lwd = 2, lty = 2)
abline(v = 0)
text(x = xmin, y = 0.8, expression(italic("Bacillus\nsubtilis")), pos = 4, cex = 2)
legend("topright", inset = 0.01, c("leading", "lagging"), lty = 1:2, col = c("black", "red"))

par(opar)

Figure 3

Discriminant power of codons. Each point represents the discriminant score of one codon, a positive value means that the codon is more frequent in leading than in lagging coding sequences. Codons are grouped by amino acids according to the one-letter code at the bottom of the figure. White dots represent codons with a keto base (G or T) in their third codon position, while red dots represents codons with an amino base (A or C).

#
#
library(seqinr)
codons <- tolower(names(Bs))
codonsMg <- tolower(names(Mg))
# Change u to t
codons <- sapply(codons, function(x) { tmp <- s2c(x); tmp[tmp=="u"] <- "t"; c2s(tmp)})
codonsMg <- sapply(codonsMg, function(x) { tmp <- s2c(x); tmp[tmp=="u"] <- "t"; c2s(tmp)})
aaorder <- s2c("ARNDCQEGHILKFPSTYVW")
aafac <- factor(aaorder, levels = aaorder)

codfac <- sapply(codons, function(x) translate(s2c(x)))
codfac <- factor(codfac, levels = aaorder)

codfacMg <- sapply(codonsMg, function(x) translate(s2c(x), numcode = 4))
codfacMg <- factor(codfacMg, levels = aaorder)

#
# Escherichia coli:
#
plot(0, type = "n", xlim = c(1,19), ylim = c(-0.8,1.0), bty = "n", las = 1,
xaxt = "n")
legend("top",expression(italic("Escherichia coli")), bty = "n", cex = 2)
abline(h=0, lty = 3)
i <- 1
for(aa in aaorder[-19]){
  synonymous <- which(codfac == aa)
  isketo <- function(c) {
    tmp <- s2c(c)[3]
    if(tmp == "g" | tmp == "t") {return("white") }else{ return("red")}
  }
  colors <- sapply(codons[synonymous], isketo)
  segments(i, min(Eccda$fa[synonymous,]), i, max(Eccda$fa[synonymous,]))
  points(x = rep(i, length(synonymous)), y = Eccda$fa[synonymous,], cex = 1.5,
    col = "black", pch = 21, bg = colors)
  i <- i + 1
}
#
# Haemophilus influenzae:
#
plot(0, type = "n", xlim = c(1,19), ylim = c(-0.4,+0.6), bty = "n", las = 1,
xaxt = "n")
legend("topleft", expression(italic("Haemophilus influenzae")), bty = "n", cex = 2)
abline(h=0, lty = 3)
i <- 1
for(aa in aaorder[-19]){
  synonymous <- which(codfac == aa)
  isketo <- function(c) {
    tmp <- s2c(c)[3]
    if(tmp == "g" | tmp == "t") {return("white") }else{ return("red")}
  }
  colors <- sapply(codons[synonymous], isketo)
  segments(i, min(Hicda$fa[synonymous,]), i, max(Hicda$fa[synonymous,]))
  points(x = rep(i, length(synonymous)), y = Hicda$fa[synonymous,], cex = 1.5,
    col = "black", pch = 21, bg = colors)
  i <- i + 1
}
#
# Bacillus subtilis:
#
plot(0, type = "n", xlim = c(1,19), ylim = c(-0.8,+0.6), bty = "n", las = 1,
xaxt = "n")
legend("topleft", expression(italic("Bacillus subtilis")), bty = "n", cex = 2)
abline(h=0, lty = 3)
i <- 1
for(aa in aaorder[-19]){
  synonymous <- which(codfac == aa)
  isketo <- function(c) {
    tmp <- s2c(c)[3]
    if(tmp == "g" | tmp == "t") {return("white") }else{ return("red")}
  }
  colors <- sapply(codons[synonymous], isketo)
  segments(i, min(Bscda$fa[synonymous,]), i, max(Bscda$fa[synonymous,]))
  points(x = rep(i, length(synonymous)), y = Bscda$fa[synonymous,], cex = 1.5,
    col = "black", pch = 21, bg = colors)
  i <- i + 1
}
#
# Mycoplasma genitalium:
#
plot(0, type = "n", xlim = c(1,19), ylim = c(-1.1,+1), bty = "n", las = 1,
xaxt = "n")
legend("topleft", expression(italic("Mycoplasma genitalium")), bty = "n", cex = 2)
abline(h=0, lty = 3)
i <- 1
for(aa in aaorder){
  synonymous <- which(codfacMg == aa)
  isketo <- function(c) {
    tmp <- s2c(c)[3]
    if(tmp == "g" | tmp == "t") {return("white") }else{ return("red")}
  }
  colors <- sapply(codonsMg[synonymous], isketo)
  segments(i, min(Mgcda$fa[synonymous,]), i, max(Mgcda$fa[synonymous,]))
  points(x = rep(i, length(synonymous)), y = Mgcda$fa[synonymous,], cex = 1.5,
    col = "black", pch = 21, bg = colors)
  i <- i + 1
}
#
# Show aa:
#
for(i in 1:19) text(x = i, y = -1.5, labels = aaorder[i], xpd = NA, cex = 1.5)
par(opar)

If you have any problems or comments, please contact Jean Lobry.