To compare dN/dS across species, we prepared a dataset of 862 orthologous genes, from 128 species. For each set of orthologous genes, we estimated model parameters (and notably branch lengths) with bppml, and then mapped substitution with mapNH along each branch of the species tree, and we recorded counts along terminal branches to compute dN/dS for the corresponding species.
Thus, for each species, we have 862 genes on which we can estimate dN/dS based on the raw counts provided by MapNH. There are two ways to compute the average value of dN/dS over each set of genes:
Here, we compare these two ways to compute the averages.
For each branch of each gene tree, we have the following parameters:
# Compute DN, DS and dN/DS on each gene along each terminal branch (species):
D$DS=D$K_S/D$O_S
D$DN=D$K_N/D$O_N
D$dNdS=D$DN/D$DS
NB: here the definition of DS and DN differs from the classical definition of dS and dN (expressed as number of subst. per site). Roughly, assuming that there are 1 syn and 2 non-syn sites per codon, this gives dS=K_S/L (where L is the number of codons) and dN=K_N/(2L).
Formula to compute the average dN/dS over all genes, weighted by gene length:
# Compute weighted mean of dN/dS:
get_wm_dNdS<-function(D) {
wmean_dNdS=sum(D$dNdS*D$sequence_length)/sum(D$sequence_length)
return(wmean_dNdS)
}
The first way to compute the cumulated dN/dS over all genes, consists simply in summing counts:
# Compute cumulated mean of dN/dS (v1): simple count sums
get_cum_dNdS_v1<-function(D) {
# Compute the cumulated number of substitutions over all genes
cum_KS=sum(D$K_S)
cum_KN=sum(D$K_N)
cum_OS=sum(D$O_S)
cum_ON=sum(D$O_N)
# Compute cumulated dN, dS and dN/dS
cum_dS=cum_KS/cum_OS
cum_dN=cum_KN/cum_ON
cum_dNdS=cum_dN/cum_dS
return(cum_dNdS)
}
For a given branch of the species tree, the length of the corresponding branch in the gene tree (branch_size) varies according to genes. For instance here is the distribution of the length of the human branch (estimated by bppml) across the 862 genes:
It should be noted that in some cases, bppml sets the branch length to 1e-12.
With the weighted mean, all genes contribute equally to the global dN/dS, whatever their branch length. But in the cumulated counts, genes with a long branch contribute more to the global sums. Hence, the cumulated dN/dS (v1) can be quite different from the weighted mean dN/dS.
Thus, the 2nd way to compute the cumulated dN/dS over all genes, consists in summing counts, after having normalized branch lengths:
# Compute cumulated mean of dN/dS (v2): counts normalized by branch_size
get_cum_dNdS_v2<-function(D) {
mean_B=mean(D$branch_size)
# Compute the cumulated number of substitutions over all genes, normalized by branch_size
cum_KS=sum(D$K_S*mean_B/D$branch_size)
cum_KN=sum(D$K_N*mean_B/D$branch_size)
cum_OS=sum(D$O_S*mean_B/D$branch_size)
cum_ON=sum(D$O_N*mean_B/D$branch_size)
# Compute cumulated dN, dS and dN/dS
cum_dS=cum_KS/cum_OS
cum_dN=cum_KN/cum_ON
cum_dNdS=cum_dN/cum_dS
return(cum_dNdS)
}
One issue is that some genes have a null branch length (for which bppml reports branch_size=1e-12). At the normalization step of the previous option (v2), these few genes are given a huge weight, and hence totally dominate the global estimate of the cumulated dN/dS. To avoid this bias, I re-set the weight of these genes by assigning them the mean branch length:
# Compute cumulated mean of dN/dS (v3): counts normalized by branch_size (after re-setting branches with a null length (1e-12))
get_cum_dNdS_v3<-function(D) {
# re-set length of branches with a null length (branch_size==1e-12)
mean_B=mean(D$branch_size)
D$branch_size[which(D$branch_size==1e-12)]=mean_B
# Compute the cumulated number of substitutions over all genes (normalized by branch_size)
cum_KS=sum(D$K_S*mean_B/D$branch_size)
cum_KN=sum(D$K_N*mean_B/D$branch_size)
cum_OS=sum(D$O_S*mean_B/D$branch_size)
cum_ON=sum(D$O_N*mean_B/D$branch_size)
# Compute cumulated dN, dS and dN/dS
cum_dS=cum_KS/cum_OS
cum_dN=cum_KN/cum_ON
cum_dNdS=cum_dN/cum_dS
return(cum_dNdS)
}
Another option consists in computing the branch length in terms of number of substitution opportunities per site:
ONS_per_bp = (O_S + O_N)/sequence_length
And then normalize counts by the mean number of substitution opportunities per site:
# Compute cumulated mean of dN/dS (v4): counts normalized by the mean number of subst. opportunities (O_S+O_N) per site
get_cum_dNdS_v4<-function(D) {
# Number of subst. opportunities (O_S+O_N) per bp
D$ONS_per_bp=(D$O_S+D$O_N)/D$sequence_length
mean_ONS=mean(D$ONS_per_bp)
# Compute the cumulated number of substitutions over all genes (normalized by mean_ONS)
cum_KS=sum(D$K_S*mean_ONS/D$ONS_per_bp)
cum_KN=sum(D$K_N*mean_ONS/D$ONS_per_bp)
cum_OS=sum(D$O_S*mean_ONS/D$ONS_per_bp)
cum_ON=sum(D$O_N*mean_ONS/D$ONS_per_bp)
# Compute cumulated dN, dS and dN/dS
cum_dS=cum_KS/cum_OS
cum_dN=cum_KN/cum_ON
cum_dNdS=cum_dN/cum_dS
return(cum_dNdS)
}
For each of the 128 species, we computed dN/dS averaged over the 862 genes, based on the weighted mean, or using the different ways to compute the cumulated mean.
The weighted mean dN/dS and the cumulated dN/dS (v1) are poorly correlated. This is because in the weighted mean, all genes contribute equally, whatever their rate of evolution. Whereas in the cumulated mean, genes with a long branch contribute more to the cumulated counts. Hence, cumularted counts leads to higher dN/dS values:
After normalizing cumulated counts by branch length, the correlation is strongly improved (black points), except for species for which there is at least one gene with branch_size=1e-12 (color points):
After normalizing by branch length, and re-setting the weight of genes with branch_size=1e-12, we obtain a good correlation between the weighted mean dN/dS and the cumulated dN/dS (v3) :
The two estimate are well correlated, but overall, the cumulated mean (v3) is 8% higher than the weighted mean.
Using the number of substitution opportunities (O_N+O_S) per site to normalize by branch length, we obtain a very good correlation between the weighted mean and the cumulated mean (v4):
The two estimate are very well correlated, but overall, the cumulated mean (v4) is 8% lower than the weighted mean.
For the calculation of dN/dS, the cumulated mean (v1) gives less weight to genes that have a short branch, whereas the weighted mean gives the same weight to all genes, and hence these two estimate are quite different.
The normalization by branch length, estimated by bppml (v3) or based on the number of substitution opportunities per site (v4), leads to a good agreement between the two estimates (but with small systematic deviations).