Tue Aug 9 12:17:54 2022
Sylvain Glémin proposed a new method to estimate the population-scaled gBGC coefficient (B), from the analysis of site frequency spectra (SFS) and accounting for SNP polarization errors. The principle of the method is similar to that described in Glémin et al. (2015), but it is much more rapid, because it is based on least squares instead of the maximum-likelihood framework.
To evaluate this method, I simulated datasets subject to different levels of gBGC (B ranging from 0 to 2), with levels of polarisation errors similar to those reported in humans. I then estimated B using least squares (gBGC_ELS).
For a comparison, I also estimated B by simple regression:
I also applied these methods on human popgen data (from the 1000 genomes project) to compare with the results obtained with the ML approach (Glémin et al. 2015).
I analyzed autosomal SNPs from the 1000 genomes project (phase3 release, 2504 individuals). The entire dataset contains 74.6 million autosomal SNPs that can be polarized, among which 98.92% are in non-coding regions, and 0.42% correspond to synonymous mutations (N=313448 synonymous SNPs).
To polarize SNPs, I used two approaches:
In the results presented below, SNPs were polarized using data from: 1KG
The proportion of different types of mutations in the entire dataset is:
## SS SW WS WW
## 0.09 0.49 0.35 0.07
These proportions are quite different for synonymous SNPs (owing to the high GC-content at synonymous site):
## SS SW WS WW
## 0.06 0.70 0.22 0.02
In particular, the proportion CpG SNPs (i.e. SNPs for which the derived or ancestral allele is in a CpG context) is quite different in the two subsets. CpG SNPs represent 28.0% of non-coding SNPs, and 49.8% of synonymous SNPs.
I computed SFSs using 100 bins of DAF. The SFSs are strongly skewed towards very low DAFs. To better visualize differences between WS, WS and SS/WW SFSs, I plot the Y axis in log scale:
This excess of ‘high-DAF’ variants reflect polarization errors, which are particularly abundant at CpG sites:
The shape of the SFS for non-CpG sites also shows an excess of high-DAF variants, due to polarization errors:
To evaluate Sylvain’s method, I performed simulations.
I take as input a neutral DAF spectrum, to which I apply gBGC (population-scaled gBGC coefficient B) on WS and SW mutations (see equation (4) from Glémin et al. 2015). For WW+SS mutation, I consider a neutral DAF spectrum:
DAF = (1:n)/(n+1)
Hws = 2 * (1-exp(-B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(-B)))
Hsw = 2 * (1-exp(B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(B)))
Hssww = 1/(1:n)
Here n corresponds to the number of bins of DAF (NbClassDAF).
To get more realistic SFS, corresponding to human demography, I used correction parameters for each frequency category, inferred from the DAF spectrum of WW/SS SNPs observed in human pops (see above).
I then normalize these SFS, taking into account the base composition of the human genome, and the relative rates of SW, WS or SS/WW mutations (rSW_WS, rWS_SSWW), to match with the observed proportions of WS, SW and WW/SS SNPs in human pops.
I then randomly draw the number of SNPs for each class of mutation and each bin of DAF.
Finally, based on these true SFS, I compute the observed SFS, accounting for polarization errors.
I then used three methods to estimate B based on the observed SFS:
I performed these simulations for values of B ranging from B=0 to B=2. For each value of B, I perform 10 replicates (to explore the impact of sampling noise on the estimate of B). NB: I repeated 3 times this set of simulation for gBGC_ELS [this is just because this script was initially prepared to evaluate a former version of the method (weighted regression) that I had tested with 3 different levels of precision].
The accuracy of the method is expected to depend on the number of SNPs and on the number of bins of DAF. A higher number of bins may provide more information on the shape of the SFS, but may lead to more sampling noise, due to limited number of SNPs per DAF bin.
I first ran simulations with polarization error rates corresponding to CpG sites (i.e. high polarization error rate for SW mutations), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).
# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65
# Range of B values
lB=c(0.001, 0.25, 0.5, 1, 2)
# Number of simulation replicates per set of parameters
nbReplicate=10
I first run simulations for a large dataset (~3 M SNPs), and estimate B using 20 bins of DAF. I show below the simulated SFSs for each value of B, and the corresponding plots [log(WS/SW) vs DAF] on which B estimates are based. To get an idea of the number of SNPs on which log(WS/SW) is computed, I report the median and minimal number of WS+SW SNPs per DAF bin.
I then plot the inferred B vs the true B:
For all methods, the inferred B is strongly correlated with the true B. But both Method0 and Method1 overestimate B, while gBGC_ELS provides an unbiased estimate of B.
With ~300k SNPs (which corresponds to the total number of synonymous SNPs in the human dataset), there is more dispersal in the estimate of B, but estimates remain very accurate.
With ~100k SNPs (e.g. if we wanted to estimate B on three subsets of synonymous SNPs), the median number of SNPs per bin of DAf is ~300. The estimates of B with gBGC_ELS is still very accurate.
With ~50k SNPs, the median number of SNPs per bin of DAf is ~150. gBGC_ELS estimates are slightly more noisy (but remains quite good and unbiased - contrarily to Method1).
With ~20k SNPs, the median number of SNPs per bin of DAf is ~70. gBGC_ELS estimates are slightly more noisy (but remains quite good and unbiased - contrarily to Method1).
If we increase the number of bins of DAF, this does not affect the estimates of B.
If we consider only 10 bins of DAF, gBGC_ELS provides estimates of B that are quite accurate and that remain unbiased even with limited SNP sample size.
I provide below a summary of the accuracy of the different methods for this set of parameters (mimicking CpG sites at synonymous codon positions). I evaluated the accuracy according to two criteria:
As expected, gBGC_ELS is sensitive to the number of SNPs : the lower the number of SNPs, the higher the level of noise in gBGC_ELS estimates. Interestingly, gBGC_ELS is more accurate than Method1.
For a high number of SNPs, changing the number of DAF bins (NbClassDAF) does not affect much the estimate of B. When the number of SNPs is more limited, NbClassDAF=20 seems to perform slightly better than 10 bins of DAF.
NB: when the number of SNPs is limited, gBGC_ELS generally cannot estimate B with NbClassDAF=100 because some DAF bins have 0 count (hence gBGC_ELS reports NA). However, if, by chance, all DAF bins have >0 count, then gBGC_ELS would report an estimate of B that can be very inaccurate (this sometimes appears in some replicates). This problem would not necessarily be detected by bootstraping (because only the bootstraps with non-0 bins would report a result). Maybe it would be useful to set a lower limit on the number of SNP per DAF bin (or to put a warning).
Both Method0 and Method1 provide B estimates that are strongly correlated to the true B (but not as well as gBGC_ELS). However, these estimates are very sensitive to variation in GC-content (see below for more details).
As expected, Method0 overestimates B, and this overestimate increases as the number of bins of DAFs decreases.
Method1 is less biased than Method0, but still leads to overestimates B.
gBGC_ELS is able to provide unbiased estimates of B, even when the number of SNPs per DAF bin is low.
I then ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).
# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans
# Ratio of mutation rates: SW/WS and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65
In terms of measuring variation in B, the results are similar to previous simulations.
For non-CpG sites, polarization errors are weaker, and hence B estimates are less biased. In particular, Method1 performs relatively well. However, gBGC_ELS clearly outperforms Method1.
I then ran simulations with polarization error rates corresponding to CpG sites (i.e. with eSW=0.04 and eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).
# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4
Again, in terms of measuring variation in B, the results are similar to previous simulations.
Again, gBGC_ELS is unbiased (contrarily to Method0 and Method1).
Finally, I ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).
# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans
# Ratio of mutation rates: SW/WS and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4
Performances are unchanged in terms of accuracy.
But for this set of parameters, all methods are unbiased.
The impact of polarization errors on the SFS (and hence on B estimates) depends on the difference between the number of WS vs SW mispolarized SNPs. It therefore depends on eWS, eSW, rSW_WS and on the GC-content of sequences. Given that the GC-content varies widely along the genome, this might induce variation in the estimate of B.
To assess how the different methods are affected by GC-content, I performed simulation for sequences of different GC-content (from 0.3 to 0.8, to mimick variation in GC3 at synonymous codons positions), with 1.5M, 150k or 50k SNPs. I set NbClassDAF=20.
I considered a genome composed of 5 compartments (each corresponding to 20% of of the genome) with B=0.001, B=0.25, B=0.5, B=1, B=2. I computed the difference between the true B and the inferred B in each compartment, and then measured the mean of these differences.
I first considered a situation where eSW=eWS=0.01 and where rSW_WS=1 (i.e. no mutational bias, same error rate for WS and SW mutations):
Method0 and Method1 tend to be biased when the GC-content of sequences is far from the mutational equilibrium (50%) => this induces an artefactual correlation between B and GC-content.
gBGC_ELS is able to provide unbiased estimates, whatever the GC-content.
I then considered a more realistic case, with rWS_SSWW=2, and with error rates corresponding to that at CpG sites (eWS=0.01, eSW=0.04).
Both Method0 and Method1 overestimate B, and this bias increases with sequence GC-content => this induces an artefactual correlation between B and GC-content.
gBGC_ELS provides unbiased estimates of B for a wide range of GC-content.
Finally, I considered a case with rWS_SSWW=2, and with error rates corresponding to that at non-CpG sites (eWS=0.01, eSW=0.01).
Again, with a large number of SNPs, gBGC_ELS provides unbiased estimates, whatever the GC-content.
I first estimated the genome-wide B, using either the entire dataset of non-coding SNPs, or CpG and non-CpG SNPs separately. I set the number of DAF bins (NbClassDAF) to 20.
NB: it is possible to estimate the uncertainty around gBGC_ELS estimates by bootstrapping the SNP dataset. But I have not yet implemented this in my script. So here I only provide one single estimate per SNP dataset.
I show below the observed SFSs and, and the corresponding plots [log(WS/SW) vs DAF] from which B is inferred:
NbClassDAF=20
I repeat the analysis with NbClassDAF=100.
NbClassDAF=100
The table below provides estimates of B obtained by maximum likelihood (M1e = M1* = model M1 with polarization errors) or by Method1 or gBGC_ELS, for NbClassDAF=20.
For gBGC_ELS, I performed 30 bootstrap replicates. I show the mean (and the standard deviation of the mean) for the estimate of B (gBGC_ELS). I also show the mean estimates of the mutation bias (mutational equilibrium GC), and of the two polarization error rates (e1=rate of WS mutations mispolarized, e2=SW).
| Sites | Method1 | M1e | gBGC_ELS | GCeqMut | e1 | e2 |
|---|---|---|---|---|---|---|
| All | 0.42 ± 0.01 | 0.38 | 0.394 ± 0.002 | 0.407 | 0.0050 | 0.0136 |
| Non-CpG | 0.27 ± 0.01 | 0.33 | 0.339 ± 0.003 | 0.425 | 0.0067 | 0.0049 |
| CpG | 0.74 ± 0.02 | 0.50 | 0.536 ± 0.006 | 0.364 | 0.0044 | 0.0344 |
Same table for NbClassDAF=100:
| Sites | Method1 | M1e | gBGC_ELS | GCeqMut | e1 | e2 |
|---|---|---|---|---|---|---|
| All | 0.42 ± 0.00 | 0.38 | 0.365 ± 0.002 | 0.410 | 0.0032 | 0.0124 |
| Non-CpG | 0.26 ± 0.01 | 0.33 | 0.310 ± 0.002 | 0.428 | 0.0044 | 0.0033 |
| CpG | 0.76 ± 0.01 | 0.50 | 0.491 ± 0.004 | 0.369 | 0.0037 | 0.0341 |
Overall, genome-wide estimates of B given by method gBGC_ELS are pretty close to ML estimates, both at non-CpG and at CpG sites (which are the most affected by polarization errors). Increasing NbClassDAF from 20 to 100 does not change much the match between ML and method gBGC_ELS estimates.
#rec="hapmap"
rec="bherer"
We are often interested in measuring variation in B along the genome. For this, we consider groups of SNPs (e.g. binned according to recombination rate or by windows along the genome), and we estimate B on each group. For example, in Glémin et al. (2015), we analyzed B in 1Mb-long windows along the genome (~2600 windows).
The above simulations indicate that gBGC_ELS is biased when the number of SNPs is limited. Hence, the accuracy of the method is expectd to depend on the size of genomic windows on which we infer B.
To evaluate this point, I divided the SNP dataset into 1, 10, 50, 100, 200, 500 or 2000 subsets, and inferred B on each subset. I then measured the genome-wide B by taking the average over all subsets.
=> gBGC_ELS appear to be robust: whatever the window size, the average estimate remains unchanged (contrarily to what we had with the weighted regression method).
NbClassDAF=20
nbinsNC=500
Here I classified non-coding SNPs into 500 bins, according to their recombination rate. I computed B on each bin, and I plotted B vs recombination rates. For a comparison, I classified synonymous SNPs into three bins of recombination rate.
NB: recombination data from a high-resolution pedigree-based recombination map: Bhérer C, Campbell CL and Auton A 2017 Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nature Communications 8: 14994. https://doi.org/10.1038/ncomms14994
nbinsNC=50
With 50 bins:
As expected, I observed strong correlations between estimates of B and recombination rates. Note however that the correlation is not linear (here I plotted the Log of recombination rate). The reason for this non-linear relationship is unclear (but might be due to the heterogeneity caused by recombination hotspots).
In their Fig. 5, Glémin et al (2015) compared the values of B at CpG or non-CpG sites, estimated with model M1 (i.e. without accounting for polarization errors, Fig5B) or M1* (i.e. accounting for polarization errors, Fig5D). With M1*, they observed that for a same recombination rate, CpG and non-CpG sites have similar values of B (contrarily to model M1, where B is over-estimated for CpG sites):
Fig. 5BD, Glémin et al (2015)
I checked whether we could retrieve this result with gBGC_ELS. For this, I classified non-coding SNPs into 50 or 500 bins of contiguous SNPs. I computed B on each bin, and I plotted B vs recombination rates.
nbinsNC=50
=> Method1 clearly overestimate B at CpG sites (relative to the ML estimate). gBGC_ELS gives results that are closer to ML estimates.
nbinsNC=500
=> gBGC_ELS is clearly less biased than Method1: even with nbinsNC=500, it provides estimates that are pretty similar to those obtained by ML.
gBGC_ELS is able to provide unbiased estimates of B, both for non-CpG and CpG sites, even when the number of SNPs is quite limited. It is robust to variation in GC-content.
=> very nice improvement compared to the previous version (weighted regression)!!
Tue Aug 9 14:00:48 2022