Wed Mar 9 11:22:07 2022

1. Introduction

Sylvain Glémin proposed a new method to estimate the population-scaled gBGC coefficient (B), from the analysis of site frequency spectra (SFS) and accounting for SNP polarization errors. The principle of the method is similar to that described in Glémin et al. (2015), but it is much more rapid, because it is uses regression instead of the maximum-likelihood framework.

Method principle

In brief, this method takes as input the SFSs of WS mutations (yWS) and SW mutations (ySW). From these SFSs, it computes the inversed SFSs:

  • zWS = rev(yWS)
  • zSW = rev(ySW)

which are used to create two variable:

  • u = zWS / ySW
  • v = zSW / yWS

for more precision, additional variables can be computed u^2, u^3, u^4 … v^2, v^3, v^4

Then one computes the multiple regression:

log(yWS/ySW) ~ x + u + v + u^2 + v^2 + u^3 +v^3

where x is the derived allele frequency (i/n).

The coefficient for x corresponds to B (like in the standard approach) and terms u and v account for polarization errors.

Method evaluation

To evaluate this method, I simulated datasets subject to different levels of gBGC (B ranging from 0 to 2), with levels of polarisation errors similar to those reported in humans. I then estimated B using multiple regression (MethodSG) with different numbers of factors (nf):

  • MethodSG nf=1: log(yWS/ySW) ~ x + u + v
  • MethodSG nf=2: log(yWS/ySW) ~ x + u + v + u^2 + v^2
  • MethodSG nf=3: log(yWS/ySW) ~ x + u + v + u^2 + v^2 + u^3 +v^3

For a comparison, I also estimated B by simple regression:

  • Method0: log(yWS/ySW) ~ x (i.e. without accounting for polarization errors)
  • Method1: log(yWS/ySW) ~ x (computed on SNPs with a DAF < 0.9, to exclude the subset of SNPs that are the most strongly affected by polarization errors)

I also applied these methods on human popgen data (from the 1000 genomes project) to compare with the results obtained with the ML approach (Glémin et al. 2015).

2. Human polymorphism dataset

I analyzed autosomal SNPs from the 1000 genomes project (phase3 release, 2504 individuals). The entire dataset contains 74.6 million autosomal SNPs that can be polarized, among which 98.92% are in non-coding regions, and 0.42% correspond to synonymous mutations (N=313448 synonymous SNPs).

The proportion of different types of mutations in the entire dataset is:

##   SS   SW   WS   WW 
## 0.09 0.49 0.35 0.07

These proportions are quite different for synonymous SNPs (owing to the high GC-content at synonymous site):

##   SS   SW   WS   WW 
## 0.06 0.70 0.22 0.02

In particular, the proportion CpG SNPs (i.e. SNPs for which the derived or ancestral allele is in a CpG context) is quite different in the two subsets. CpG SNPs represent 28.0% of non-coding SNPs, and 49.8% of synonymous SNPs.

I computed SFSs using 100 bins of DAF. The SFSs are strongly skewed towards very low DAFs. To better visualize differences between WS, WS and SS/WW SFSs, I plot the Y axis in log scale:

3. Evaluate methods by simulations

To evaluate Sylvain’s method, I performed simulations.

I take as input a neutral DAF spectrum, to which I apply gBGC (population-scaled gBGC coefficient B) on WS and SW mutations (see equation (4) from Glémin et al. 2015). For WW+SS mutation, I consider a neutral DAF spectrum:

DAF = (1:n)/(n+1)
Hws = 2 * (1-exp(-B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(-B)))
Hsw = 2 * (1-exp(B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(B)))
Hssww = 1/(1:n)

Here n corresponds to the number of bins of DAF (NbClassDAF).

To get more realistic SFS, corresponding to human demography, I used correction parameters for each frequency category, inferred from the DAF spectrum of WW/SS SNPs observed in human pops (see above).

I then normalize these SFS, taking into account the base composition of the human genome, and the relative rates of SW, WS or SS/WW mutations (rSW_WS, rWS_SSWW), to match with the observed proportions of WS, SW and WW/SS SNPs in human pops.

I then randomly draw the number of SNPs for each class of mutation and each bin of DAF.

Finally, based on these true SFS, I compute the observed SFS, accounting for polarization errors.

I then used three methods to estimate B based on the observed SFS:

I performed these simulations for values of B ranging from B=0 to B=2. For each value of B, I perform 10 replicates (to explore the impact of sampling noise on the estimate of B).

The accuracy of the method is expected to depend on the number of SNPs and on the number of bins of DAF. A higher number of bins may provide more information on the shape of the SFS, but may lead to more sampling noise, due to limited number of SNPs per DAF bin.

Simulation parameters

I first ran simulations with polarization error rates corresponding to CpG sites (i.e. high polarization error rate for SW mutations), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).

# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65
# Range of B values
lB=c(0.001, 0.25, 0.5, 1, 2)
# Number of simulation replicates per set of parameters
nbReplicate=10

NbClassDAF=20

number of SNPs~3M

I first run simulations for a large dataset (~3 M SNPs), and estimate B using 20 bins of DAF. I show below the simulated SFSs for each value of B, and the corresponding plots [log(WS/SW) vs DAF] on which B estimates are based. To get an idea of the number of SNPs on which log(WS/SW) is computed, I report the median and minimal number of WS+SW SNPs per DAF bin.

I then plot the inferred B vs the true B:

For all methods, the inferred B is strongly correlated with the true B. But both Method0 and Method1 overestimate B, while MethodSG provides an unbiased estimate of B. Higher values of nf lead to more noise in the estimate of B.

number of SNPs~300k

With ~300k SNPs (which corresponds to the total number of synonymous SNPs in the human dataset), there is more dispersal in the estimate of B, particularly of MethodSG with nf=2 or nf=3. But MethodSG with nf=1 still provides a relatively accurate and unbiased estimates of B.

number of SNPs~100k

With ~100k SNPs (e.g. if we wanted to estimate B on three susbsets of synonymous SNPs), the median number of SNPs per bin of DAf is ~300. The estimates of B with MethodSG becomes quite noisy, even with nf=1.

number of SNPs~50k

With ~50k SNPs, the median number of SNPs per bin of DAf is ~150. MethodSG (with nf=1) tends to overestimate B (it is as biased as Method1).

NbClassDAF=100

If we increase the number of bins of DAF, this does not improve the estimates of B. On the contrary when the number of SNPs is limited, then MethodSG becomes less accurate and more biased than Method1.

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

NbClassDAF=10

If we consider only 10 bins of DAF, MethodSG with nf=1 provides estimates of B that are more noisy than with 20 bins of DAF, but that remain unbiased even with limited SNP sample size.

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

number of SNPs~50k

Summary of simulations: step 1: CpG sites, GC_content=0.65

I provide below a summary of the accuracy of the different methods for this set of parameters (mimicking CpG sites at synonymous codon positions). I evaluated the accuracy according to two criteria:

  • Correlation between the estimated and the true B => measure to what extent the variation in the estimated B reflects true variation
  • Mean difference between the estimated and the true B => measure to what extent the estimates of B are biased

As expected, methodSG is sensitive to the number of bins of DAFs (NbClassDAF) and also to the median number of SNPs per DAF bin. The lower the number of SNPs, the higher the level of noise in methodSG estimates. Increasing the number of factors (from nf=1 to nf=3) leads to increase the level of noise.

For a given number of SNPs per DAF bin, increasing the number of DAF bins leads to stronger correlations between the estimated and true B. However, increasing the number of DAF bins implies that the number of SNPs per DAF bin is decreased. There is therefore a trade-off between increasing the number of DAF bins (to better capture the shape of the SFS) and increasing the number of SNPs per DAF bin (to reduce sampling noise).

Both Method0 and Method1 provide B estimates that are strongly correlated to the true B, even for small SNP datasets. Thus, eventhough these methods provide biased estimates (see below), they may be useful to capture variation of B along the genome (but see below for more details regarding biases linked to variation in GC-content).

As expected, Method0 overestimates B, and this overestimate increases as the number of bins of DAFs decreases.

Method1 is less biased than Method0, but still leads to overestimates B.

MethodSG is able to provide unbiased estimates of B, but only when the number of SNPs per DAF bin is sufficiently large. For a median number SNPs per DAF bin below 500, MethodSG tends to overestimate B (even more than Method1 for small SNP datasets). Overall MethodSG with nf=1 tends to perform better than with nf=2 or 3.

For this set of parameters, increasing the number of factors (from nf=1 to nf=3) does not reduce the bias in the estimate of B (on the contrary!).

Summary of simulations: step 2: non-CpG sites, GC_content=0.65

I then ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).

# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans

# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65

In terms of measuring variation in B, the results are similar to previous simulations: MethodSG is more noisy than Method0 or Method1 when the number of SNPs is limited.

For non-CpG sites, polarization errors are weaker, and hence B estimates are less biased. In particular, Method1 performs relatively well. MethodSG outperforms Method1 only when the number of SNPs is high

Summary of simulations: step 3: CpG sites (GC_content=0.4)

I then ran simulations with polarization error rates corresponding to CpG sites (i.e. with eSW=0.04 and eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).

# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4

Again, in terms of measuring variation in B, the results are similar to previous simulations: MethodSG is more noisy than Method0 or Method1 when the number of SNPs is limited.

Again, MethodSG is less biased than Method0 or Method1 when the number of SNPs is sufficiently large

Summary of simulations: step 4: non-CpG sites (GC_content=0.4)

Finally, I ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).

# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans

# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4

Performances are unchanges in terms of accuracy.

But for this set of parameters, all methods are unbiased. Given that MethodSG is more noisy, Method0 and Method1 are better than MethodSG.

Impact of GC-content

The impact of polarization errors on the SFS (and hence on B estimates) depends on the difference between the number of WS vs SW mispolarized SNPs. It therefore depends on eWS, eSW, rSW_WS and on the GC-content of sequences. Given that the GC-content varies widely along the genome, this might induce variation in the estimate of B.

To assess how the different methods are affected by GC-content, I performed simulation for sequences of different GC-content (from 0.3 to 0.8, to mimick variation in GC3 at synonymous codons positions), with 3M, 300k or 100k SNPs. I set NbClassDAF=20.

I considered a genome composed of 5 compartments (each corresponding to 20% of of the genome) with B=0.001, B=0.25, B=0.5, B=1, B=2. I computed the difference between the true B and the inferred B in each compartment, and then measured the mean of these differences.

I first considered a situation where eSW=eWS=0.01 and where rSW_WS=1 (i.e. no mutational bias, same error rate for WS and SW mutations):

All methods tends to be biased when the GC-content of sequences is far from the mutational equilibrium (50%) => this induces an artefactual correlation between B and GC-content.

MethodSG is able to provide unbiased estimates, but only if the number of SNPs is very large. With smaller SNP sample sizes, Method1 is more robust than MethodSG.

I then considered a more realistic case, with rWS_SSWW=2, and with error rates corresponding to that at CpG sites (eWS=0.01, eSW=0.04).

Both Method0 and Method1 overestimate B, and this bias increases with sequence GC-content => this induces an artefactual correlation between B and GC-content.

MethodSG with nf=1 provides (relatively) unbiased estimates of B for a wide range of GC-content, but only when the median number of SNPs per DAF bin is sufficiently large. When the number of SNPs is limited (~300 SNPs per DAF bin), MethodSG with nf=1 tends to overestimate B.

Finally, I considered a case with rWS_SSWW=2, and with error rates corresponding to that at non-CpG sites (eWS=0.01, eSW=0.01).

Again, with a large number of SNPs, MethodSG with nf=1 provides unbiased estimates, whatever the GC-content. But for smaller SNP dataset, Method1 performs as well as MethodSG (except for very GC-rich sequences, where MethodSG is still better than Method1).

4. Estimate B on empirical data

Genome-wide B, NbClassDAF=20

I first estimated the genome-wide B, using either the entire dataset of non-coding SNPs, or CpG and non-CpG SNPs separately. I set the number of DAF bins (NbClassDAF) to 20.

I show below the observed SFSs and, and the corresponding plots [log(WS/SW) vs DAF] from which B is inferred:

NbClassDAF=20

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Genome-wide B, NbClassDAF=100

I repeat the analysis with NbClassDAF=100.

NbClassDAF=100

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Comparison with Glémin et al. 2015 (Table 1)

The table below provides estimates of B obtained by maximum likelihood (M1e = M1* = model M1 with polarization errors) or by Method1 or MethodSG (with nf=1, 2 or 3), for NbClassDAF=20:

Sites M1e Method1 MethodSG_nf1 MethodSG_nf2 MethodSG_nf3
All 0.38 0.42 ± 0.01 0.39 ± 0.01 0.35 ± 0.01 0.32 ± 0.03
Non-CpG 0.33 0.27 ± 0.01 0.24 ± 0.02 0.29 ± 0.03 0.32 ± 0.07
CpG 0.50 0.74 ± 0.02 0.59 ± 0.03 0.48 ± 0.05 0.32 ± 0.10

Same table for NbClassDAF=100:

Sites M1e Method1 MethodSG_nf1 MethodSG_nf2 MethodSG_nf3
All 0.38 0.42 ± 0.00 0.36 ± 0.01 0.36 ± 0.01 0.35 ± 0.01
Non-CpG 0.33 0.26 ± 0.01 0.24 ± 0.00 0.24 ± 0.01 0.26 ± 0.01
CpG 0.50 0.76 ± 0.01 0.58 ± 0.01 0.57 ± 0.02 0.47 ± 0.03

Overall, genome-wide estimates of B given by method SG (with nf=1 or nf=2) are pretty close to ML estimates, notably at CpG sites (which are the most affected by polarization errors). Increasing NbClassDAF from 20 to 100 does not improve much the match between ML and method SG estimates.

Estimate B per genomic window

rec="hapmap"
#rec="bherer"

Impact of the number of SNPs on the estimate of B on CpG sites

We are often interested in measuring variation in B along the genome. For this, we consider groups of SNPs (e.g. binned according to recombination rate or by windows along the genome), and we estimate B on each group. For example, in Glémin et al. (2015), we analyzed B in 1Mb-long windows along the genome (~2600 windows).

The above simulations indicate that method SG is biased when the number of SNPs is limited. Hence, the accuracy of the method is expectd to depend on the size of genomic windows on which we infer B.

To evaluate this point, I divided the SNP dataset into 1, 10, 50, 100, 200, 500 or 2000 subsets, and inferred B on each subset. I then measured the genome-wide B by taking the average over all subsets.

At CpG sites (for which the polarization error is the highest), estimates of B are sensitive to the number of SNPs: below 1000 SNPs per DAF bin (i.e. when the entire dataset is split into ~100 windows), the estimate of B tends to increase, well above the ML estimate, and become more biased than Method1. This suggests that, when we split the dataset in a large number of small genomic window, B is overestimated at CpG sites (as observed in simulations). For non-CpG sites, methodSG and method1 provide very similar estimates, even for small window sizes (again, consistent with simulations run with GC-content=0.4).

Bins of recombination rate

NbClassDAF=20
nbinsNC=500

Here I classified non-coding SNPs into 500 bins, according to their recombination rate. I computed B on each bin, and I plotted B vs recombination rates. For a comparison, I classified synonymous SNPs into three bins of recombination rate.

nbinsNC=50

With 50 bins:

As expected, I observed strong correlations between estimates of B and recombination rates. Note however that the correlation is not linear (here I plotted the Log of recombination rate). The reason for this non-linear relationship is unclear (but might be due to the heterogeneity caused by recombination hotspots). Interestingly, the correlations are generally stronger for Method1 than for method SG, even for nf=1. This is consitent with simulations, which show that method SG provides estimates that are less affected by polarization errors, but that are more noisy. It is also possible that the correlation is artefactually enhanced by Method1, because regions of high recombination have a higher GC-content.

We see again that the genome-wide estimates obtained with method SG is inflated when the dataset is split into subsets with limited number of SNPs.

Bins of contiguous SNPs: CpG vs non-CpG

In their Fig. 5, Glémin et al (2015) compared the values of B at CpG or non-CpG sites, estimated with model M1 (i.e. without accounting for polarization errors, Fig5B) or M1* (i.e. accounting for polarization errors, Fig5D). With M1*, they observed that for a same recombination rate, CpG and non-CpG sites have similar values of B (contrarily to model M1, where B is over-estimated for CpG sites):

Fig. 5BD, Glémin et al (2015)

Fig. 5BD, Glémin et al (2015)

I checked whether we could retrieve this result with method SG. For this, I classified non-coding SNPs into 50 or 500 bins of contiguous SNPs. I computed B on each bin, and I plotted B vs recombination rates.

nbinsNC=50

nbinsNC=50

=> Method1 clearly overestimate B at CpG sites (relative to the ML estimate). MethodSG gives results that are closer to ML estimates (notably with nf=2).

nbinsNC=500

nbinsNC=500

=> with nbinsNC=500, MethodSG with nf=1 seems to be as biased as Method1. With nf=2 or 3, estimates appear to be very noisy.

5. Conclusion

For CpG sites, MethodSG is able to provide unbiased estimates of B, but only when the number of SNPs is very large. Hence, if one wants to analyze variation in B along the genome at the Mb scale (i.e. split the dataset into ~2500 bins), then MethodSG is not appropriate. The number of SNPs at synonymous sites appears to be at the limit where MethodSG is still accurate. But it is not possible to further split this subset (e.g. to compare B at synonymous sites for genes in regions of high vs low recombination rate).

For non-CpG sites, MethodSG is able to provide unbiased estimates of B, even in relatively small genomic windows, and is generally better than Method1.

Wed Mar 9 11:50:02 2022