Lobry (1995) JME 40:326

This page allows for the on-line reproduction (and some updates) of the figures in the paper: Lobry, J.R. (1995) Properties of a general model of DNA evolution under no-strand bias conditions. Journal of Molecular Evolution, 40:326-330; 41:680. [ PDF]

Abstract: Under the hypothesis of no-strand-bias conditions, the Watson and Crick base-pairing rule decreases the complexity of models of DNA evolution by reducing to six the maximum number of substitution rates. It was shown that intrastrand equimolarity between A and T (A*=T*) and between G and C (G*=C*) is a general asymptotic property of this class of models. This statistical prediction was observed on 60 long genomic fragments (> 50 kbp) from various kingdoms, even when the effect of the two opposite orientations for coding sequences is removed. The practical consequence of the model for estimating the expected number of substitutions per site between two homologous DNA sequences is discussed.

1. The pattern of nucleotide substitutions

Figure 1 showed the pattern of nucleotide substitution, in percent, estimated from 13 pseudogene sequences by Li et al. (1984). Connected substitition rates should be equal under PR1 hypothesis. The column and row orders were wrong in the original figure (i.e. A T G C instead of A T C G), this is corrected here. This doesn't change anything for connected values.

#
# Figure 1 with Li et al. 1984 data
#
r <- 0.05
circle <- function(x, y, r)
{
  a <- seq(from = 0, to = 2*pi, length = 100)
  lines(x+r*cos(a), y+r*sin(a))
}
pos <- seq(from = 0, to = 1, length = 6)
par(mar = rep(0.5, 4))
plot(pos, -pos, asp = 1, type = "n", xaxt="n",yaxt="n",
  xlim = range(pos-1/12), ylim = range(-pos+1/12), bty ="n")
for( i in pos[-c(1,6)])
  for( j in -pos[-c(1,6)])
    if( i != -j )
    {
      circle(i, j, r)
    }
bases <- c("A", "T", "C", "G")
text(x = pos[-c(1,6)], y = -(pos[1]+pos[2])/2, labels = bases)
text(x = (pos[3]+pos[4])/2, y = -pos[1], label = "From")
text(y = -pos[-c(1,6)], x = (pos[1]+pos[2])/2, labels = bases)
text(y = -(pos[3]+pos[4])/2, x = pos[1], label = "To")
points(pos[-c(1,6)], -pos[-c(1,6)], pch = "-")

text(pos[3],-pos[2],"4.4")
text(pos[4],-pos[2],"6.5")
text(pos[5],-pos[2],"20.7")

text(pos[2],-pos[3],"4.7")
text(pos[4],-pos[3],"21")
text(pos[5],-pos[3],"7.2")

text(pos[2],-pos[4],"5.0")
text(pos[3],-pos[4],"8.2")
text(pos[5],-pos[4],"5.3")

text(pos[2],-pos[5],"9.4")
text(pos[3],-pos[5],"3.3")
text(pos[4],-pos[5],"4.2")

d <- r*sqrt(2)/2
segments(x0 = pos[3]-d, y0 = -pos[2]-d, x1 = pos[2]+d, y1 = -pos[3]+d)
segments(x0 = pos[5]-d, y0 = -pos[2]-d, x1 = pos[4]+d, y1 = -pos[3]+d)
segments(x0 = pos[3]-d, y0 = -pos[4]-d, x1 = pos[2]+d, y1 = -pos[5]+d)
segments(x0 = pos[5]-d, y0 = -pos[4]-d, x1 = pos[4]+d, y1 = -pos[5]+d)
segments(x0 = pos[4]+d, y0 = -pos[2]-d, x1 = pos[5]-d, y1 = -pos[3]+d)
segments(x0 = pos[2]+d, y0 = -pos[4]-d, x1 = pos[3]-d, y1 = -pos[5]+d)

Since then, the estimation of substitution rates from pseudogen data has been considerably extended. The values just below are based on the work of Ron Ophir. The values are from Table 4.5 in the Graur and Li second edition. The advantages, as compared with the previous estimate, are that it is based on a much more consequent dataset (105 pseudogenes) and from a single taxon (human) so that taxon-specific patterns are not amalgamated.

#
# Figure 1 with Ron Ophir 2000 data
#
r <- 0.05
circle <- function(x, y, r)
{
  a <- seq(from = 0, to = 2*pi, length = 100)
  lines(x+r*cos(a), y+r*sin(a))
}
pos <- seq(from = 0, to = 1, length = 6)
par(mar = rep(0.5, 4))
plot(pos, -pos, asp = 1, type = "n", xaxt="n",yaxt="n",
  xlim = range(pos-1/12), ylim = range(-pos+1/12), bty ="n")
for( i in pos[-c(1,6)])
  for( j in -pos[-c(1,6)])
    if( i != -j )
    {
      circle(i, j, r)
    }
bases <- c("A", "T", "C", "G")
text(x = pos[-c(1,6)], y = -(pos[1]+pos[2])/2, labels = bases)
text(x = (pos[3]+pos[4])/2, y = -pos[1], label = "From")
text(y = -pos[-c(1,6)], x = (pos[1]+pos[2])/2, labels = bases)
text(y = -(pos[3]+pos[4])/2, x = pos[1], label = "To")
points(pos[-c(1,6)], -pos[-c(1,6)], pch = "-")

text(pos[3],-pos[2],"3.3")
text(pos[4],-pos[2],"4.2")
text(pos[5],-pos[2],"20.4")

text(pos[2],-pos[3],"3.4")
text(pos[4],-pos[3],"20.7")
text(pos[5],-pos[3],"4.4")

text(pos[2],-pos[4],"4.5")
text(pos[3],-pos[4],"13.8")
text(pos[5],-pos[4],"4.9")

text(pos[2],-pos[5],"12.5")
text(pos[3],-pos[5],"3.3")
text(pos[4],-pos[5],"4.6")

2. Observed base composition in long DNA fragments

Figure 2 showed the base composition in long genomic fragments. At that time (15 june 1994) there were only 60 sequences with more than 50 Kbp available in GenBank release 83. For each panel, the slope, a, of the regression line y = ax + b, and the linear correlation coefficient, r, were given. The equalities between the total number of A and T and between the total number of C and G was striking. Note that these base counts are from single-stranded DNA, the observed equalities are not a trivial consequence of base pairing rules in double- stranded DNA.

#
# Figure 2 with original data
#
atcg <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/jme95/50kbgb83.txt", sep = "\t", h=T)
attach(atcg)
par(no.readonly = TRUE) -> opar
par(mfrow = c(3, 3), mar = c(0,0,0,0) , omi = rep(0.6, 4))

my.plot <- function(x, y, kd, xlim = c(0, 100), ylim = c(0,100), 
           xaxs = "i", yaxs = "i", xlab = "", ylab = "", ...)
{
  data <- cbind(x,y)
  plot(data, type = "n", xlim = xlim, ylim = ylim, xaxs = xaxs, yaxs = yaxs, 
      xaxt = "n", yaxt = "n", ...)
  abline(coef=lm(y~x)$coef)
  text(x = 50, y = 90, paste("a =", round(lm(y~x)$coef[2], 3), "r =", 
        round(cor(x,y),3)), cex = 1.2)
  points(data[kd=="virus",], pch = 22, cex = 2)
  points(data[kd=="procaryota",], pch = 15, cex = 2)
  points(data[kd=="nematod",], pch = 1, cex = 2)
  points(data[kd=="chloroplast",], pch = 16, cex = 2)
  points(data[kd=="insecta",], pch = 5, cex = 2)
  points(data[kd=="vertebrata",], pch = 23, cex = 2, bg = "black")
  points(data[kd=="mito",], pch = 6, cex = 2)
  points(data[kd=="fungi",], pch = 25, cex = 2, bg = "black")
  tcl <- 0.4
  axis(side = 1, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 2, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 3, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 4, at = 10*(0:10), tcl = tcl, label = FALSE)
}

titext <- function()
{
  plot.window(xlim = c(0,100), ylim = c(0,100), xaxs = "i", yaxs = "i")
  text(x = rep(0,3), y = c(5, 50, 95), labels = c("0","50","100"), pos = 4)
  text(x = c(45, 85), y = c(5, 5), labels = c("50","100"), pos = 4)
}

my.plot(A..kb., C..kb., kingdom, ylab = "C (kb)")
plot.new()
titext()
plot.new()
plot.window(xlim = c(0,1), ylim = c(0,1))
legend(0.1, 1, legend = c("virus", "procaryota", "nematoda", "chloroplast",
    "insecta", "vertebrata", "mitochondrion", "fungi"),
    pch = c(22, 15, 1, 16, 5, 23, 6, 25), pt.bg = c("white","black"), 
    cex = 1.2, pt.cex = 2)
my.plot(A..kb., G..kb., kingdom, ylab = "G (kb)")
my.plot(C..kb., G..kb., kingdom)
plot.new()
titext()
my.plot(A..kb., T..kb., kingdom)
my.plot(C..kb., T..kb., kingdom)
my.plot(G..kb., T..kb., kingdom, xlab = "G kb")

mtext("C (kb)", side = 2, outer = TRUE, at = 0.85, line = 1, las = 1)
mtext("G (kb)", side = 2, outer = TRUE, at = 0.5, line = 1, las = 1)
mtext("T (kb)", side = 2, outer = TRUE, at = 0.15, line = 1, las = 1)
mtext("A (kb)", side = 1, outer = TRUE, at = 0.15, line = 1)
mtext("C (kb)", side = 1, outer = TRUE, at = 0.5, line = 1)
mtext("G (kb)", side = 1, outer = TRUE, at = 0.85, line = 1)
par(opar)

There are now much more data available in GenBank. In release 144 including daily updates up to 24-NOV-2004 I found 117,692 fragments with more than 50 Kbp. One interesting thing is that since august 2004, the maximum size of 350 Kbp for a GenBank entry has been relaxed. On the other hand, the "BASE COUNT" line in GenBank flat files is no more present, so that you can't just grep them to get the base counts. Anyway, if you want the base counts in these long fragments I have computed them for you here. In the following there are only 80,590 long sequences represented because I have removed sequences with more than 1% of undetermined bases.

The slopes are very close to 1 for A vs T and G vs C, which means that we are very close to A = T and C = G (that is to PR2 state) on average. Not too bad, the data set is now 1000 times bigger and the prediction of the model still holds (the total number of bases in this dataset is 13,264,842,944). Because the distribution of the size of fragments is highly skewed, it is also interesting to look at a log-log representation. The outliers in red below correspond to synthetic fragments, that is artificial sequences that are not the result of a long evolutionary history. They are interesting because they demonstrate that PR2 state is not an instrinsic property of long ssDNA fragments: there is a priori no reason to have A = T and C = G in ssDNA.

#
# Figure 2 updated in log scale :
#
big50 <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/jme95/big50.acgt", sep = "\t", header = TRUE)
# Remove sequences with more than 1% of ambiguous bases
big50 <- big50[ big50$O/big50$bp < 0.01, ]

synthetic <- big50$mnemo %in% readLines("http://pbil.univ-lyon1.fr/members/lobry/repro/jme95/synthetic.mne")

panel.corslope <- function(x, y, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- cor(x, y)
  text(0.5, 0.4, paste("r =", round(r, 5)))
  a <- lm(y~x)$coef[2]
  text(0.5, 0.6, paste("a =", round(a, 5)))
 }

panel.lm <- function(x, y, ...)
{
  points(x, y, ...)
  abline(coef = lm(y~x)$coef, col = "blue")
  points(x[synthetic], y[synthetic], col = "red", ...)
}

pairs(log10(big50[ , 3:6]), cex = 0.5, las = 1,
  main = paste("Base counts in", nrow(big50), "sequences (log10 scale)"),
  lower.panel = panel.lm, upper.panel = panel.corslope, 
  labels = c("log10(A bp)", "log10(C bp)", "log10(G bp)", "log10(T bp)"))

3. Early GC-skew and AT-skew representations

I think this is the first skew diagram I have ever published:

#
# Figure 3
#
library(seqinr)
opar <- par(no.readonly = TRUE)
par(mar = c(2.1, 3.1, 0.4, 2.1))
eco <- read.fasta("http://pbil.univ-lyon1.fr/members/lobry/repro/jme95/ECO110K.fasta")[[1]]
wsize <- 1000
wstep <- 100
gc <- GC(eco)
x <- seq(from = 1, to = length(eco) - wsize, by = wstep)
data <- data.frame(matrix(nrow = length(x), ncol = 4))
names(data) <- c("a","t","c","g")
rownames(data) <- x
i <- 1
for( pos in x)
{
  data[i, "a"] <- sum(eco[pos:(pos+wsize-1)] == "a", na.rm = TRUE)
  data[i, "t"] <- sum(eco[pos:(pos+wsize-1)] == "t", na.rm = TRUE)
  data[i, "c"] <- sum(eco[pos:(pos+wsize-1)] == "c", na.rm = TRUE)
  data[i, "g"] <- sum(eco[pos:(pos+wsize-1)] == "g", na.rm = TRUE)
  i <- i + 1
}
par(mfrow=c(2,1), omi=c(0.5,0,0.2,0))
atskew <- 100*(data[,"a"]-data[,"t"])/(data[,"a"]+data[,"t"])
plot(x = x/1000, y = atskew, 
  type = "l", las = 1, xlab = "", ylab = "", ylim = c(-40,40),
  xlim = c(0,120), xaxs = "i", yaxs = "i", tcl = 0.5)
text(0, 33, "% (A-T)/(A+T)", pos = 4)
W <- (1-gc)*wsize
A <- W/2
T <- W/2
abline(h = 100*1.96*(2/W)*sqrt(A*T/W), lty = 2)
abline(h = -100*1.96*(2/W)*sqrt(A*T/W), lty = 2)
abline(h = mean(atskew), col = "red")

gcskew <- 100*(data[,"c"]-data[,"g"])/(data[,"c"]+data[,"g"])
plot(x = x/1000, y = gcskew, 
  type = "l", las = 1, xlab = "", ylab = "", ylim = c(-25,20),
  xlim = c(0,120), xaxs = "i", yaxs = "i", tcl = 0.5)
text(0, 15, "% (C-G)/(C+G)", pos = 4)
S <- gc*wsize
C <- S/2
G <- S/2
abline(h = 100*1.96*(2/S)*sqrt(C*G/S), lty = 2)
abline(h = -100*1.96*(2/S)*sqrt(C*G/S), lty = 2)
abline(h = mean(gcskew), col = "red")

mtext("Position (Kbp)", side = 1, outer = TRUE, adj = 0.5)
par(opar)

I had some problems to reproduce this figure. First, I have lost the original sequence file (ECO110K). There is still an ECO110K entry in GenBank, but it was updated on 23-JAN-2004, so that I'm not sure to work with exactly the same data. The version used here is the 23-JAN-2004 version (D10483.2 GI:21321891). Second, there is a small difference for the location of the confidence lines, and I'm unable to trace back the trouble because at that time I was not working under R but with a commercial software that is no more maintained, so I can't repeat now what I have done to produce this figure. I'm affraid that this is just one more example of what Jonathan Buckheit and David Donoho called à la recherche des paramètres perdus in their famous paper about reproducible research. In the present version of the figure the confidence lines are based on the formulas given here and the small simulation just below should convince you that they are exact. Last, but not least, the horizontal red lines at the mean value were not drawn in the original figure. The fragment ECO110K is just on the right of the origin of replication of the chromosome (0.0 - 2.4 min region). We know now that this region, corresponding to the leading strand for replication, is systematically enriched in G over C. I didn't notice it at that time...

4. Base counts in coding sequences

Figure 4 showed that PR2 state was also found when considering only the sense strand of coding sequences. For an update see there.

#
# Figure 4 with original data
#
atcg <- read.table("http://pbil.univ-lyon1.fr/members/lobry/repro/jme95/50kbgb83.txt", sep = "\t", h=T)
atcg <- atcg[ atcg$CDS.Bp > 50000, ] # remove small
attach(atcg)
par(no.readonly = TRUE) -> opar
par(mfrow = c(3, 3), mar = c(0,0,0,0) , omi = rep(0.6, 4))

my.plot <- function(x, y, kd, xlim = c(0, 100), ylim = c(0,100), 
           xaxs = "i", yaxs = "i", xlab = "", ylab = "", ...)
{
  data <- cbind(x,y)
  plot(data, type = "n", xlim = xlim, ylim = ylim, xaxs = xaxs, yaxs = yaxs, 
      xaxt = "n", yaxt = "n", ...)
  abline(coef=lm(y~x)$coef)
  text(x = 50, y = 90, paste("a =", round(lm(y~x)$coef[2], 3), "r =", 
        round(cor(x,y),3)), cex = 1.2)
  points(data[kd=="virus",], pch = 22, cex = 2)
  points(data[kd=="procaryota",], pch = 15, cex = 2)
  #points(data[kd=="nematod",], pch = 1, cex = 2)
  points(data[kd=="chloroplast",], pch = 16, cex = 2)
  #points(data[kd=="insecta",], pch = 5, cex = 2)
  #points(data[kd=="vertebrata",], pch = 23, cex = 2, bg = "black")
  points(data[kd=="mito",], pch = 6, cex = 2)
  points(data[kd=="fungi", 1], data[kd=="fungi", 2], pch = 25, cex = 2, bg = "black")

tcl <- 0.4
  axis(side = 1, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 2, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 3, at = 10*(0:10), tcl = tcl, label = FALSE)
  axis(side = 4, at = 10*(0:10), tcl = tcl, label = FALSE)
}

my.plot(CDS.A..kb., CDS.C..kb., kingdom, ylab = "C (kb)")
plot.new()
titext()
plot.new()
plot.window(xlim = c(0,1), ylim = c(0,1))
legend(0.1, 1, legend = c("virus", "procaryota", "chloroplast",
    "mitochondrion", "fungi"),
    pch = c(22, 15, 16, 6, 25), 
    pt.bg = c("white","black", "black", "white","black"), 
    cex = 1.2, pt.cex = 2)
my.plot(CDS.A..kb., CDS.G..kb., kingdom, ylab = "G (kb)")
my.plot(CDS.C..kb., CDS.G..kb., kingdom)
plot.new()
titext()
my.plot(CDS.A..kb., CDS.T..kb., kingdom)
my.plot(CDS.C..kb., CDS.T..kb., kingdom)
my.plot(CDS.G..kb., CDS.T..kb., kingdom, xlab = "G kb")

par(opar)

If you have any problems or comments, please contact Jean Lobry.