Introduction
Parameter notation
Computing DN, DS and dN/dS
Branch length
Average dN/dS
Results
- Cumulated dN/dS (CM) vs. weighted mean (WM)
- Branch-normalized cumulated mean dN/dS (BNCM) vs weighted mean (WM)
Conclusion

Introduction

To compare dN/dS across species, we prepared a dataset of 862 orthologous genes, from 128 species. For each set of orthologous genes, we estimated global model parameters with bppml, and then mapped substitution with mapNH along each branch of the species tree. We then recorded counts along terminal branches to compute dN/dS for the corresponding species.

Thus, for each species, we have 862 genes on which we can estimate dN/dS based on the raw counts provided by MapNH. There are two ways to compute the average value of dN/dS over each set of genes:

weighted means: compute dN/dS on each gene, and then compute the average, over all genes, weighted by their length.
cumulated means: compute the sums of counts of substitutions (syn or non-syn) and substitution opportunities across all genes to compute ‘global’ dN and dS, from which a ‘global’ dN/dS can be computed.

Here, we compare these two ways to compute the averages.

Parameter notation

For each branch of each gene tree, we have the following parameters (provided by MapNH):

K_S: count of syn substitutions (per gene per branch)
K_N: count of non-syn substitutions (per gene per branch)
O_S: count of syn substitutions ‘opportunities’ (per gene per branch)
O_N: count of non-syn substitutions ‘opportunities’ (per gene per branch)

I also computed O=O_S+O_N:

O: total number of substitutions ‘opportunities’ (per gene per branch)

Computing DN, DS and dN/dS

For each gene, we computed DN, DS and dN/DS for each terminal branch (species):

D$DS=D$K_S/D$O_S
D$DN=D$K_N/D$O_N

D$dNdS=D$DN/D$DS

NB: here the definition of DS and DN (expressed as number of subst. per ‘opportunity’) differs from the classical definition of dS and dN (expressed as number of subst. per site). Roughly, assuming that there are 1 syn and 2 non-syn sites per codon, this gives dS=K_S/L (where L is the number of codons) and dN=K_N/(2L).

Branch length

The length of a branch (B) in a given gene tree is defined here as the number of substitution opportunities (syn or non-syn) in that gene during this period of time. To account for variation in gene length, B is expressed in number of substitution opportunities per site.

B=(O_S+O_N)/sequence_length

D$O=D$O_S+D$O_N
D$B=D$O/D$sequence_length

In a given branch of the species tree, if all genes were subject to the same mutation rate, one would a priori expect that they should all have the branch length (B). In reality, we observe variation in B, which can be substantial in some species. Here are a few examples:

In fact, B depends on two parameters:

the duration time of the branch in the species tree (T)
the mutation rate in that gene (U)

For a given branch of the species tree, the duration time (T) is expected to be approximately the same for all genes (although introgressions and variation in coalescent time among genes may cause some variation in T, in particular for short branches; and conversely, hidden paralogy can lead to much longer branches).

The mutation rate (U) may vary along chromosomes (notably depending on the G+C content), and hence may also cause variation in B among genes.

Some variation in B may also be due to the model used to estimate these parameters. Indeed, the number of substitution opportunities is estimated based on the assumption that synonymous substitutions are neutral (Laurent G: correct??). But in some species, synonymous sites are under selection, and this selective pressure varies among genes. This may therefore induce variation in the estimates of B among genes.

Finally, for short genes in short branches, it is not possible to estimate B accurately. The first mode of the bimodal distributions observed for Homo sapiens and Equus caballus, correspond to genes for which (O_S+O_N)~0 (in bppml, branch length cannot be 0; so the minimal branch length is set to 1e-12):

In any case, whatever the cause of this variation in B, in a given branch of the species tree, the number of substitution opportunities per site differs among genes.

In some species, like Aedes aegypti or Mus musculus, the variation in B is relatively limited (10 to 30 fold). But in others, like Homo sapiens or Equus caballus, there is a 1e5 to 1e6 fold variation in B.

Average dN/dS

I computed the average dN/dS of each species using three different approaches:

weighted mean (WM)
cumulated mean (CM)
branch-normalized cumulated mean (BNCM)

Average dN/dS (1): weighted mean (WM)

Formula to compute the average dN/dS over all genes, weighted by gene length:

# Compute weighted mean of dN/dS:
get_wm_dNdS<-function(D) {
  wmean_dNdS=sum(D$dNdS*D$sequence_length)/sum(D$sequence_length)
  return(wmean_dNdS)
}

Average dN/dS (2): cumulated mean (CM)

The first way to compute the cumulated dN/dS over all genes, consists simply in summing counts:

# Compute cumulated mean of dN/dS (v1): simple count sums
get_cum_dNdS_v1<-function(D) {
  # Compute the cumulated number of substitutions over all genes
  cum_KS=sum(D$K_S)
  cum_KN=sum(D$K_N)
  cum_OS=sum(D$O_S)
  cum_ON=sum(D$O_N)
  
  # Compute cumulated DN, DS 
  cum_dS=cum_KS/cum_OS
  cum_dN=cum_KN/cum_ON
  
  # Compute cumulated dN/dS
  cum_dNdS=cum_dN/cum_dS
 
  return(cum_dNdS)
}

Average dN/dS (3): branch-normalized cumulated mean (BNCM)

With the weighted mean (WM), all genes contribute equally to the global dN/dS, whatever their branch length. But in the cumulated counts, genes with a short branches (B) contribute little to the global sums. Typically, consider a human gene, H1, with B=1e-6 opportunities per site, and another one H2, with B=1e-1 opportunities per site. H1 and H2 contribute equally to the WM dN/dS; but H2 contributes 1e5 times more than H1 to the CM dN/dS.

Hence, the cumulated mean dN/dS (CM) can be quite different from the weighted mean dN/dS (WM).

To avoid this effect, another way to compute the cumulated dN/dS over all genes, consists in summing counts, normalized so that all genes have the same branch lengths (B):

# Compute cumulated mean of dN/dS (v4): counts normalized by the mean number of subst. opportunities (O_S+O_N) per site

get_cum_dNdS_v4<-function(D) {
  # Number of subst. opportunities (O_S+O_N) per bp
  mean_B=mean(D$B)
  # Compute the cumulated number of substitutions over all genes (normalized by mean_B)
  cum_KS=sum(D$K_S*mean_B/D$B)
  cum_KN=sum(D$K_N*mean_B/D$B)
  cum_OS=sum(D$O_S*mean_B/D$B)
  cum_ON=sum(D$O_N*mean_B/D$B)
  
  # Compute cumulated DN, DS and dN/dS
  cum_dS=cum_KS/cum_OS
  cum_dN=cum_KN/cum_ON
  
  cum_dNdS=cum_dN/cum_dS
 
  return(cum_dNdS)
}

Results

For each of the 128 species, we computed dN/dS averaged over the 862 genes, based on the weighted mean, or using the different ways to compute the cumulated mean.

Cumulated dN/dS (CM) vs. weighted mean (WM)

The weighted mean dN/dS and the cumulated dN/dS (CM) are poorly correlated. This is because in the weighted mean, all genes contribute equally, whatever their branch length. Whereas in the cumulated mean, genes with a long branch contribute more to the cumulated counts. Hence, cumulated counts leads to higher dN/dS values:

Branch-normalized cumulated mean dN/dS (BNCM) vs weighted mean (WM)

Using the number of substitution opportunities (O_N+O_S) per site to normalize by branch length, we obtain a very good correlation between the weighted mean and the branch-normalized cumulated mean (BNCM):

The two estimate are very well correlated, but overall, the branch-normalized cumulated mean (BNCM) is 8% lower than the weighted mean (probably because BNCM gives more weight to genes with small B, which may tend to have smaller dN/dS.

Conclusion

For the calculation of dN/dS, the cumulated mean (CM) gives less weight to genes that have a short branch, whereas the weighted mean gives the same weight to all genes. Given the variation in B among gene, these two estimate can be quite different.

The branch-normalized cumulated dN/dS (BNCM), is in good agreement between with the weighted mean dN/dS (but with small systematic deviations).

The CM version is probably not a good option because it leads to essentially ignore all genes with a very small B. The reasons for the strong variation in B observed in real data are not totally clear. But there is a priori no reason why genes with small B should be ignored from the estimate of the genome-wide average dN/dS.

The BNCM version is better because it takes all genes into account (except those for which B=0, and hence the branch-normalized counts are undefined (division by 0)).

Thus, in the end CM should be avoided, whereas BNCM and WM give very similar results. Given that the protocol to compute WM is much simpler to explain than that of BNCM, I suggest that we simply use WM.

dN/dS MapNH: cumulated mean or weighted mean??