Tue Aug 9 12:17:54 2022

1. Introduction: gBGC_ELS

Sylvain Glémin proposed a new method to estimate the population-scaled gBGC coefficient (B), from the analysis of site frequency spectra (SFS) and accounting for SNP polarization errors. The principle of the method is similar to that described in Glémin et al. (2015), but it is much more rapid, because it is based on least squares instead of the maximum-likelihood framework.

Method evaluation

To evaluate this method, I simulated datasets subject to different levels of gBGC (B ranging from 0 to 2), with levels of polarisation errors similar to those reported in humans. I then estimated B using least squares (gBGC_ELS).

For a comparison, I also estimated B by simple regression:

Method0: log(yWS/ySW) ~ x (i.e. without accounting for polarization errors)
Method1: log(yWS/ySW) ~ x (computed on SNPs with a DAF < 0.9, to exclude the subset of SNPs that are the most strongly affected by polarization errors)

I also applied these methods on human popgen data (from the 1000 genomes project) to compare with the results obtained with the ML approach (Glémin et al. 2015).

2. Human polymorphism dataset

I analyzed autosomal SNPs from the 1000 genomes project (phase3 release, 2504 individuals). The entire dataset contains 74.6 million autosomal SNPs that can be polarized, among which 98.92% are in non-coding regions, and 0.42% correspond to synonymous mutations (N=313448 synonymous SNPs).

To polarize SNPs, I used two approaches:

1- Polarization based on outgroups from the EPO-6primates alignment (=data provided by 1KG)
2- Polarization estimated by est-sfs (using chimp and gorilla as outgroups, with model=kimura 2 parameters)

In the results presented below, SNPs were polarized using data from: 1KG

The proportion of different types of mutations in the entire dataset is:

##   SS   SW   WS   WW 
## 0.09 0.49 0.35 0.07

These proportions are quite different for synonymous SNPs (owing to the high GC-content at synonymous site):

##   SS   SW   WS   WW 
## 0.06 0.70 0.22 0.02

In particular, the proportion CpG SNPs (i.e. SNPs for which the derived or ancestral allele is in a CpG context) is quite different in the two subsets. CpG SNPs represent 28.0% of non-coding SNPs, and 49.8% of synonymous SNPs.

I computed SFSs using 100 bins of DAF. The SFSs are strongly skewed towards very low DAFs. To better visualize differences between WS, WS and SS/WW SFSs, I plot the Y axis in log scale:

This excess of ‘high-DAF’ variants reflect polarization errors, which are particularly abundant at CpG sites:

The shape of the SFS for non-CpG sites also shows an excess of high-DAF variants, due to polarization errors:

3. Evaluate methods by simulations

To evaluate Sylvain’s method, I performed simulations.

I take as input a neutral DAF spectrum, to which I apply gBGC (population-scaled gBGC coefficient B) on WS and SW mutations (see equation (4) from Glémin et al. 2015). For WW+SS mutation, I consider a neutral DAF spectrum:

DAF = (1:n)/(n+1)
Hws = 2 * (1-exp(-B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(-B)))
Hsw = 2 * (1-exp(B*(1-DAF)))/(DAF*(1-DAF)*(1-exp(B)))
Hssww = 1/(1:n)

Here n corresponds to the number of bins of DAF (NbClassDAF).

To get more realistic SFS, corresponding to human demography, I used correction parameters for each frequency category, inferred from the DAF spectrum of WW/SS SNPs observed in human pops (see above).

I then normalize these SFS, taking into account the base composition of the human genome, and the relative rates of SW, WS or SS/WW mutations (rSW_WS, rWS_SSWW), to match with the observed proportions of WS, SW and WW/SS SNPs in human pops.

I then randomly draw the number of SNPs for each class of mutation and each bin of DAF.

Finally, based on these true SFS, I compute the observed SFS, accounting for polarization errors.

I then used three methods to estimate B based on the observed SFS:

Method0: slope of the regression [log(nbWS/nbSW)~DAF] computed on all SNPs
Method1: slope of the regression [log(nbWS/nbSW)~DAF] computed on SNPs with a DAF < 0.9 (to exclude the last bins of DAF, which are strongly affected by polarization errors)
gBGC_ELS: Sylvain’s least squares method

I performed these simulations for values of B ranging from B=0 to B=2. For each value of B, I perform 10 replicates (to explore the impact of sampling noise on the estimate of B). NB: I repeated 3 times this set of simulation for gBGC_ELS [this is just because this script was initially prepared to evaluate a former version of the method (weighted regression) that I had tested with 3 different levels of precision].

The accuracy of the method is expected to depend on the number of SNPs and on the number of bins of DAF. A higher number of bins may provide more information on the shape of the SFS, but may lead to more sampling noise, due to limited number of SNPs per DAF bin.

Simulation parameters

I first ran simulations with polarization error rates corresponding to CpG sites (i.e. high polarization error rate for SW mutations), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).

# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65
# Range of B values
lB=c(0.001, 0.25, 0.5, 1, 2)
# Number of simulation replicates per set of parameters
nbReplicate=10

NbClassDAF=20

number of SNPs~3M

I first run simulations for a large dataset (~3 M SNPs), and estimate B using 20 bins of DAF. I show below the simulated SFSs for each value of B, and the corresponding plots [log(WS/SW) vs DAF] on which B estimates are based. To get an idea of the number of SNPs on which log(WS/SW) is computed, I report the median and minimal number of WS+SW SNPs per DAF bin.

I then plot the inferred B vs the true B:

For all methods, the inferred B is strongly correlated with the true B. But both Method0 and Method1 overestimate B, while gBGC_ELS provides an unbiased estimate of B.

number of SNPs~300k

With ~300k SNPs (which corresponds to the total number of synonymous SNPs in the human dataset), there is more dispersal in the estimate of B, but estimates remain very accurate.

number of SNPs~100k

With ~100k SNPs (e.g. if we wanted to estimate B on three subsets of synonymous SNPs), the median number of SNPs per bin of DAf is ~300. The estimates of B with gBGC_ELS is still very accurate.

number of SNPs~50k

With ~50k SNPs, the median number of SNPs per bin of DAf is ~150. gBGC_ELS estimates are slightly more noisy (but remains quite good and unbiased - contrarily to Method1).

number of SNPs~20k

With ~20k SNPs, the median number of SNPs per bin of DAf is ~70. gBGC_ELS estimates are slightly more noisy (but remains quite good and unbiased - contrarily to Method1).

NbClassDAF=100

If we increase the number of bins of DAF, this does not affect the estimates of B.

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

NbClassDAF=10

If we consider only 10 bins of DAF, gBGC_ELS provides estimates of B that are quite accurate and that remain unbiased even with limited SNP sample size.

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

number of SNPs~50k

number of SNPs~10k

Summary of simulations: step 1: CpG sites, GC_content=0.65

I provide below a summary of the accuracy of the different methods for this set of parameters (mimicking CpG sites at synonymous codon positions). I evaluated the accuracy according to two criteria:

Correlation between the estimated and the true B => measure to what extent the variation in the estimated B reflects true variation
Mean difference between the estimated and the true B => measure to what extent the estimates of B are biased

As expected, gBGC_ELS is sensitive to the number of SNPs : the lower the number of SNPs, the higher the level of noise in gBGC_ELS estimates. Interestingly, gBGC_ELS is more accurate than Method1.

For a high number of SNPs, changing the number of DAF bins (NbClassDAF) does not affect much the estimate of B. When the number of SNPs is more limited, NbClassDAF=20 seems to perform slightly better than 10 bins of DAF.

NB: when the number of SNPs is limited, gBGC_ELS generally cannot estimate B with NbClassDAF=100 because some DAF bins have 0 count (hence gBGC_ELS reports NA). However, if, by chance, all DAF bins have >0 count, then gBGC_ELS would report an estimate of B that can be very inaccurate (this sometimes appears in some replicates). This problem would not necessarily be detected by bootstraping (because only the bootstraps with non-0 bins would report a result). Maybe it would be useful to set a lower limit on the number of SNP per DAF bin (or to put a warning).

Both Method0 and Method1 provide B estimates that are strongly correlated to the true B (but not as well as gBGC_ELS). However, these estimates are very sensitive to variation in GC-content (see below for more details).

As expected, Method0 overestimates B, and this overestimate increases as the number of bins of DAFs decreases.

Method1 is less biased than Method0, but still leads to overestimates B.

gBGC_ELS is able to provide unbiased estimates of B, even when the number of SNPs per DAF bin is low.

Summary of simulations: step 2: non-CpG sites, GC_content=0.65

I then ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to synonymous codon position (i.e. GC-rich sequences).

# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans

# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.65

In terms of measuring variation in B, the results are similar to previous simulations.

For non-CpG sites, polarization errors are weaker, and hence B estimates are less biased. In particular, Method1 performs relatively well. However, gBGC_ELS clearly outperforms Method1.

Summary of simulations: step 3: CpG sites (GC_content=0.4)

I then ran simulations with polarization error rates corresponding to CpG sites (i.e. with eSW=0.04 and eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).

# Polarization error rates for WS and SW mutations:
eSW=0.04 # matches observed error rate at CpG sites in humans
eWS=0.01 # matches observed error rate at other sites
# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4

Again, in terms of measuring variation in B, the results are similar to previous simulations.

Again, gBGC_ELS is unbiased (contrarily to Method0 and Method1).

Summary of simulations: step 4: non-CpG sites (GC_content=0.4)

Finally, I ran simulations with polarization error rates corresponding to non-CpG sites (i.e. with eSW=eWS=0.01), and a GC-content corresponding to the genome-wide average (40% G+C).

# Polarization error rates for WS and SW mutations:
eSW=0.01 # matches observed error rate at non-CpG sites in humans
eWS=0.01 # matches observed error rate at non-CpG sites in humans

# Ratio of mutation rates: SW/WS  and WS/(SS+WW)
rSW_WS=2
rWS_SSWW=2
# GC-content (match synonymous sites)
GC_content=0.4

Performances are unchanged in terms of accuracy.

But for this set of parameters, all methods are unbiased.

Impact of GC-content

The impact of polarization errors on the SFS (and hence on B estimates) depends on the difference between the number of WS vs SW mispolarized SNPs. It therefore depends on eWS, eSW, rSW_WS and on the GC-content of sequences. Given that the GC-content varies widely along the genome, this might induce variation in the estimate of B.

To assess how the different methods are affected by GC-content, I performed simulation for sequences of different GC-content (from 0.3 to 0.8, to mimick variation in GC3 at synonymous codons positions), with 1.5M, 150k or 50k SNPs. I set NbClassDAF=20.

I considered a genome composed of 5 compartments (each corresponding to 20% of of the genome) with B=0.001, B=0.25, B=0.5, B=1, B=2. I computed the difference between the true B and the inferred B in each compartment, and then measured the mean of these differences.

I first considered a situation where eSW=eWS=0.01 and where rSW_WS=1 (i.e. no mutational bias, same error rate for WS and SW mutations):

Method0 and Method1 tend to be biased when the GC-content of sequences is far from the mutational equilibrium (50%) => this induces an artefactual correlation between B and GC-content.

gBGC_ELS is able to provide unbiased estimates, whatever the GC-content.

I then considered a more realistic case, with rWS_SSWW=2, and with error rates corresponding to that at CpG sites (eWS=0.01, eSW=0.04).

Both Method0 and Method1 overestimate B, and this bias increases with sequence GC-content => this induces an artefactual correlation between B and GC-content.

gBGC_ELS provides unbiased estimates of B for a wide range of GC-content.

Finally, I considered a case with rWS_SSWW=2, and with error rates corresponding to that at non-CpG sites (eWS=0.01, eSW=0.01).

Again, with a large number of SNPs, gBGC_ELS provides unbiased estimates, whatever the GC-content.

4. Estimate B on empirical data

Genome-wide B, NbClassDAF=20

I first estimated the genome-wide B, using either the entire dataset of non-coding SNPs, or CpG and non-CpG SNPs separately. I set the number of DAF bins (NbClassDAF) to 20.

NB: it is possible to estimate the uncertainty around gBGC_ELS estimates by bootstrapping the SNP dataset. But I have not yet implemented this in my script. So here I only provide one single estimate per SNP dataset.

I show below the observed SFSs and, and the corresponding plots [log(WS/SW) vs DAF] from which B is inferred:

NbClassDAF=20

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Genome-wide B, NbClassDAF=100

I repeat the analysis with NbClassDAF=100.

NbClassDAF=100

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Comparison with Glémin et al. 2015 (Table 1)

The table below provides estimates of B obtained by maximum likelihood (M1e = M1* = model M1 with polarization errors) or by Method1 or gBGC_ELS, for NbClassDAF=20.

For gBGC_ELS, I performed 30 bootstrap replicates. I show the mean (and the standard deviation of the mean) for the estimate of B (gBGC_ELS). I also show the mean estimates of the mutation bias (mutational equilibrium GC), and of the two polarization error rates (e1=rate of WS mutations mispolarized, e2=SW).

Sites	Method1	M1e	gBGC_ELS	GCeqMut	e1	e2
All	0.42 ± 0.01	0.38	0.394 ± 0.002	0.407	0.0050	0.0136
Non-CpG	0.27 ± 0.01	0.33	0.339 ± 0.003	0.425	0.0067	0.0049
CpG	0.74 ± 0.02	0.50	0.536 ± 0.006	0.364	0.0044	0.0344

Same table for NbClassDAF=100:

Sites	Method1	M1e	gBGC_ELS	GCeqMut	e1	e2
All	0.42 ± 0.00	0.38	0.365 ± 0.002	0.410	0.0032	0.0124
Non-CpG	0.26 ± 0.01	0.33	0.310 ± 0.002	0.428	0.0044	0.0033
CpG	0.76 ± 0.01	0.50	0.491 ± 0.004	0.369	0.0037	0.0341

Overall, genome-wide estimates of B given by method gBGC_ELS are pretty close to ML estimates, both at non-CpG and at CpG sites (which are the most affected by polarization errors). Increasing NbClassDAF from 20 to 100 does not change much the match between ML and method gBGC_ELS estimates.

Estimate B per genomic window

#rec="hapmap"
rec="bherer"

Impact of the number of SNPs on the estimate of B on CpG sites

We are often interested in measuring variation in B along the genome. For this, we consider groups of SNPs (e.g. binned according to recombination rate or by windows along the genome), and we estimate B on each group. For example, in Glémin et al. (2015), we analyzed B in 1Mb-long windows along the genome (~2600 windows).

The above simulations indicate that gBGC_ELS is biased when the number of SNPs is limited. Hence, the accuracy of the method is expectd to depend on the size of genomic windows on which we infer B.

To evaluate this point, I divided the SNP dataset into 1, 10, 50, 100, 200, 500 or 2000 subsets, and inferred B on each subset. I then measured the genome-wide B by taking the average over all subsets.

=> gBGC_ELS appear to be robust: whatever the window size, the average estimate remains unchanged (contrarily to what we had with the weighted regression method).

Bins of recombination rate

NbClassDAF=20
nbinsNC=500

Here I classified non-coding SNPs into 500 bins, according to their recombination rate. I computed B on each bin, and I plotted B vs recombination rates. For a comparison, I classified synonymous SNPs into three bins of recombination rate.

NB: recombination data from a high-resolution pedigree-based recombination map: Bhérer C, Campbell CL and Auton A 2017 Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nature Communications 8: 14994. https://doi.org/10.1038/ncomms14994

nbinsNC=50

With 50 bins:

As expected, I observed strong correlations between estimates of B and recombination rates. Note however that the correlation is not linear (here I plotted the Log of recombination rate). The reason for this non-linear relationship is unclear (but might be due to the heterogeneity caused by recombination hotspots).

Bins of contiguous SNPs: CpG vs non-CpG

In their Fig. 5, Glémin et al (2015) compared the values of B at CpG or non-CpG sites, estimated with model M1 (i.e. without accounting for polarization errors, Fig5B) or M1* (i.e. accounting for polarization errors, Fig5D). With M1*, they observed that for a same recombination rate, CpG and non-CpG sites have similar values of B (contrarily to model M1, where B is over-estimated for CpG sites):

Fig. 5BD, Glémin et al (2015)

I checked whether we could retrieve this result with gBGC_ELS. For this, I classified non-coding SNPs into 50 or 500 bins of contiguous SNPs. I computed B on each bin, and I plotted B vs recombination rates.

nbinsNC=50

nbinsNC=50

=> Method1 clearly overestimate B at CpG sites (relative to the ML estimate). gBGC_ELS gives results that are closer to ML estimates.

nbinsNC=500

nbinsNC=500

=> gBGC_ELS is clearly less biased than Method1: even with nbinsNC=500, it provides estimates that are pretty similar to those obtained by ML.

5. Conclusion

gBGC_ELS is able to provide unbiased estimates of B, both for non-CpG and CpG sites, even when the number of SNPs is quite limited. It is robust to variation in GC-content.

=> very nice improvement compared to the previous version (weighted regression)!!

Tue Aug 9 14:00:48 2022

Estimating B from site frequency spectra

1. Introduction: gBGC_ELS

Method evaluation

2. Human polymorphism dataset

3. Evaluate methods by simulations

Simulation parameters

NbClassDAF=20

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

number of SNPs~50k

number of SNPs~20k

NbClassDAF=100

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

NbClassDAF=10

number of SNPs~3M

number of SNPs~300k

number of SNPs~100k

number of SNPs~50k

number of SNPs~10k

Summary of simulations: step 1: CpG sites, GC_content=0.65

Summary of simulations: step 2: non-CpG sites, GC_content=0.65

Summary of simulations: step 3: CpG sites (GC_content=0.4)

Summary of simulations: step 4: non-CpG sites (GC_content=0.4)

Impact of GC-content

4. Estimate B on empirical data

Genome-wide B, NbClassDAF=20

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Genome-wide B, NbClassDAF=100

All non-coding SNPs

Non-CpG SNPs

CpG SNPs

Comparison with Glémin et al. 2015 (Table 1)

Estimate B per genomic window

Impact of the number of SNPs on the estimate of B on CpG sites

Bins of recombination rate

Bins of contiguous SNPs: CpG vs non-CpG

nbinsNC=50

nbinsNC=500

5. Conclusion