Introduction: impact of sampling variance on the observed number of substitutions

We want to analyze variation in the genome-wide dN/dS across metazoan taxa. For this, we have a set of orthologous genes (~900 BUSCO genes), from a set of species (~200 to 800 species). For each alignment of orthologous genes, we estimated model parameters with bppml, and then mapped substitutions with mapNH along each branch of the species tree. We then recorded counts along terminal branches to compute dN/dS for the corresponding species (j). We thus obtain the following counts:

Based on these counts, there are two ways to compute the average value of dN/dS over the whole set of genes:

  1. per-gene mean (PGM): compute dN/dS on each gene, and then compute the average over all genes.
dNij=(K_Nij/O_Nij)
dSij=(K_Sij/O_Sij)
dNdSij=dNij/dSij
PGMj=mean(dNdSij)
  1. cumulated mean (CM): compute the sums of counts of substitutions (syn or non-syn) and substitution opportunities across all genes to compute a ‘cumulated’ dN and dS, from which a ‘cumulated’ dN/dS can be computed.
dNj=sum(K_Nij)/sum(O_Nij)
dSj=sum(K_Sij)/sum(O_Sij)
CMj=dNj/dSj

In general, both approaches are expected to give very similar results. However, when branches are short (i.e. when the expected number of synonymous substitutions is low), the PGM method can lead to overestimate dN/dS, because the observed number of synonymous substitutions (and hence dS) can be strongly affected by the sampling variance.

Typically, the average length of BUSCO proteins is ~400 aa. Thus, if we consider a substitution rate of 1% at neutral sites (which corresponds to branch lengths since the human/gorilla split), the expected number of synonymous substitutions per gene is ~4. For such a low count, the sampling variance is expected to be quite strong:

Ngene=900
NbCodons=400
U=1e-2

# Randomly draw synonymous substitutions for each gene (each codon has a probability U to be
# subsject to a syn substitution) 
NbSynSubst=rbinom(n=Ngene, size=NbCodons, prob=U)

For longer proteins (say 4000 aa), the expected number of synonymous substitutions is higher, and hence there is much less variance in the observed dS:

Ngene=900
NbCodons=4000
U=1e-2

# Randomly draw synonymous substitutions for each gene (each codon has a probability U to be
# subsject to a syn substitution) 
NbSynSubst=rbinom(n=Ngene, size=NbCodons, prob=U)

Simulation

Both dN and dS can be affected by this sampling variance. To illustrate how this can bias the PGM estimate of dN/dS, I considered a simple model: all mutations affecting the third codon position are considered as synonymous (and neutral), whereas all mutations at 1st and 2nd codon are non-synonymous. The ratio of fixation probability of non-synonymous mutations relative to the fixation probability of synonymous mutations is given by omega (with omega < 1).

Thus, for a given gene of in a given species, I randomly draw synonymous substitutions (with a probability per codon given by the mutation rate along the terminal branch of that species: U), and I randomly draw non-synonymous mutations. Then each non-synonymous mutations can get fixed with a probability given by omega. Finally, I record the observed number of synonymous and non-synonymous substitutions in that gene.

I repeat this process on a set of 900 genes (with some distribution of omega values), and at the end I compute the average dN/dS on the whole gene set, with the per-gene mean (PGM) and cumulated mean (CM) methods:

# L: protein length (number of codons)
# U: mutation rate (per site) along the branch (i.e. µ x time) = neutral substitution rate
# Ngene: number of genes in the gene set
# omega = ratio (fixation probability non-syn mutation)/(fixation probability syn mutation)

run_sim1<-function(L=400, U=1e-2, Ngene=900) {

# Draw values of omega for each gene (uniform distribution from 0.01 to 0.3)
omega=c(10^(runif(Ngene, min=-2, max=-0.5)))

# Draw the number of syn substitutions (1 mutational opportunity per codon) for each gene.
# Given, that syn are neutral, the syn substitution rate is equal to the mutation rate:
Ks=rbinom(n=Ngene, size=L, prob=U)

# Draw the number of non-syn mutations (2 mutational opportunities per codon) for each gene.
muN=rbinom(n=Ngene, size=L*2, prob=U)

# Draw the number of non-syn substitutions per gene given the number of non-syn mutations and
# their fixation probability (omega)
Kn=mapply(FUN=rbinom, size=muN, prob=omega, MoreArgs = list(n=1))
  

# Compute the 'realized' dN and dS (number of substitutions per site) and the dN/dS 
# ratio for each gene
dN=Kn/(2*L)
dS=Ks/L
dNdS=dN/dS

# Record the number of genes for which dN/dS cannot be computed (because dS=0)
dNdS[which(dS==0)]=NA
NbGenes_dNdS_NA=length(which(is.na(dNdS)==T))

# Compute the mean per-gene dNdS (on genes for which dN/dS could be computed)
Avg_dNdS=mean(dNdS, na.rm = T)

# Compute cumulated dN and dS
cumdN=sum(Kn)/(2*L*Ngene)
cumdS=sum(Ks)/(L*Ngene)

# Compute the cumulated dN/dS
cum_dNdS=cumdN/cumdS

# Record the average number of syn substitution per gene
AvgSyn=mean(Ks)

return(list(omega=mean(omega), cum_dNdS=cum_dNdS, Avg_dNdS=Avg_dNdS, NbGenes_dNdS_NA=NbGenes_dNdS_NA, AvgSyn=AvgSyn))
}

Short branch, 400 codons per gene

For genes with 400 codons, in a short branch (U=1e-2), the PGM method leads to overestimate dN/dS, while the CM method provides an unbiased estimate:

Short branch, 1000 codons per gene

For longer genes (1000 codons), in a short branch (U=1e-2), the PGM method still overestimates dN/dS (whereas the CM method remains unbiased):

Long branch, 400 codons per gene

For longer branches (U=1e-1), the bias of the PGM method becomes negligible:

Conclusion

The CM approach should be preferred to the PGM approach because it is more robust to sampling variance in short branches.

Acknowledgments

Thanks to Carina Mugal who explained me all these points.

Mon Jan 16 13:36:23 2023