Thu May 26 08:28:23 2022

Introduction

I plan to use the program est-sfs (Keightley & Jackson 2018) to polarize human SNPs from the 1000 Genome Project (1KG). Given that this program is quite cpu intensive, I first conducted some tests on a subset of SNPs from human chromosome 22, using different combinations of parameters, to try and find the best tradeoff between computation time and quality of inferrence.

Input datasets: SNPs and outgroup sequences, CpG classification

I prepared a dataset corresponding to human chromosome 22 (51 Mb).

For each position in this chromosome, I recovered SNP polymorphism data from 1KG (phase3, 2504 individuals): http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

I retrieved the corresponding sequence in chimp (P) and gorilla (G) from the EPO Compara 6 primate block (NB: the information of EPO alignments was extracted from the VCF of the Altai-Neandertal: http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/). About 35 Mb of chromosome 22 (67.9% of the chromosome) are included in the EPO alignment. At this step, sites that are reported in the EPO data as having in-paralogs in any of the 3 species were excluded (~ 1.8 Mb).

CpG sites have an ~10-fold higher mutation rate than non-CpG sites in humans. Hence, est-sfs has to be run separately for CpG and non-CpG sites.

Sites were classified as ‘CpG’ if any of the human alleles (REF, ALT), or chimp or gorilla sequence was in a CpG context. Otherwise they were classified as ‘non-CpG’.

est-sfs was run on all CpG sites from chr22 (1.5 Mb), and on a subset of non-CpG sites (3 Mb, 9% of all non-CpG sites).

Parameters affecting the speed of est-sfs

The speed of est-sfs depends on the model used:

  • Jukes-Cantor (JC)
  • Kimura 2 parameters (kimura)
  • Rate-6

The speed of est-sfs also depends on the number of iterations of the ML search algorithm (parameter nrandom). In the case of the Rate-6 model, it is recommended that at least 10 random starting value runs are carried out to ensure convergence. For model JC and kimura, I used respectively nrandom=2 and nrandom=5 (but not sure if it is useful to have nrandom>1 for these two models).

The speed of est-sfs also depends on the number of outgroups used (one or two).

Finally, the speed of est-sfs also depends on the number of individuals that are used to build the SFS from which the ancestral allelic state is inferred. The 1KG dataset was obtained from 2504 individuals (5008 chromosomes = 1KG_full dataset). To speed up the process, for each SNP, I randomly subsampled 199 chromosomes from the 1KG_full dataset (= 1KG_sampled199 dataset). Based on this subset, I inferred the ancestral allelic state with est-sfs. Then, once the ancestral/derived states are inferred, I compute the derived allele frequency (DAF) in the full 1KG dataset (5008 chromosomes).

Results

I ran all combination of parameters (3 models, 1 or 2 outgroups) on CpG and non-CpG sites, using either the 1KG_full or the 1KG_sampled199 datasets.

est-sfs provides two output files:

For each site, I draw the ancestral state randomly, based on the probability of the major allele to be ancestral. Then, knowing which allele is ancestral or derived, I classified SNPs as WS, SW, or WW/SS, and computed the frequency of the derived allele in the full 1KG dataset (5008 chromosomes). For a comparison, I also considered the ancestral state inferred by parsimony, based on the EPO alignment (data provided by 1KG).

The plots below show the SFSs of each SNP category (WS, SW, or WW/SS), obtained with different methods/set of parameters. SFSs are computed with 50 bins of DAF.

1. CpG sites: SFS per SNP category (WS, SW, or WW/SS)

1.1 CpG SNPs polarized by parsimony (1KG ancestral allele)

When CpG SNPs polarized by parsimony, the inferred SFS shows a typical signature of polarization errors, with a strong excess of SNPs at very high DAF (6.1% of SNPs in the last DAF bin):

NB: in these plots, the dashed line correspond to the SFS obtained after exclusion of SNPs for which the chimp or gorilla outgroup sequence is missing. When SNPs are polarized by parsimony, at least one outgroup is required. But when using est-sfs, SNPs can be polarized even when the two outgroups are missing.

1.2 CpG : est-sfs (model=JC)

In agreement with Keightley & Jackson (2018, Fig 6B) we observed that using est-sfs with the JC model does not improve much the quality of ancestral states inferrences at CpG sites: there is still an excess of SNPs at very high DAF (1.5% to 5.2% of SNPs in the last DAF bin). Using one or two outgroups does not change much the results.

Using the subsample (1KG_sampled199, 1.2.1) or the entire dataset (1KG_full, 1.2.2), does not change much the results. But using the entire dataset strongly increases the computation time.

1.2.1 1KG_sampled199

  • 1 outgroup: 5’ (2 replicates)
  • 2 outgroups: 45’ (2 replicates)

1.2.2 1KG_full

  • 1 outgroup: 1h (2 replicates)
  • 2 outgroups: 5h20 (2 replicates)

1.3 CpG : est-sfs (model=kimura)

With the kimura model, est-sfs seems to correctly get rid of polarization errors (~0.3% of SNPs in the last DAF bin). Using one or two outgroups or the full dataset does not change much the results (at least visually):

1.3.1 1KG_sampled199

  • 1 outgroup: 50’ (5 replicates)
  • 2 outgroups: 1h20 (5 replicates)

1.3.2 1KG_full

  • 1 outgroup: 2h40 (5 replicates)
  • 2 outgroups: 17h (5 replicates)

1.4 CpG : est-sfs (model=rate6)

Results obtained with the rate6 model are similar to those obtained the kimura model (at least visually). I could not infer ancestral states on the full dataset with two outgroups because est-sfs takes too much time (not finished after 5 days).

1.4.1 1KG_sampled199

  • 1 outgroup: 3h40 (10 replicates)
  • 2 outgroups: 10h40 (10 replicates)

1.4.2 1KG_full

  • 1 outgroup: 43h (10 replicates)
  • 2 outgroups: >72h (10 replicates) -> not finished

2. non-CpG sites: SFS per SNP category (WS, SW, or WW/SS)

The issue of polarization errors is less important at non-CpG sites compared to CpG sites (compare 2.1 to 1.1). There is however still an excess of SNPs at very high DAF (1.2% of SNPs in the last DAF bin).

With the subsampling strategy (1KG_sampled199), est-sfs reduces the fraction of high DAF SNPs (~0.5-0.6% of SNPs in the last DAF bin), whatever the model used (JC, kimura or rate6).

But curiously, with the entire dataset (1KG_full), the fraction of high DAF SNPs raises to ~1% (close to what is obtained when SNPs are polarized by parsimony).

2.1 Non-CpG SNPs polarized by parsimony (1KG ancestral allele)

2.2 Non-CpG : est-sfs (model=JC)

2.2.1 1KG_sampled199

  • 1 outgroup: 15’ (2 replicates)
  • 2 outgroups: 2h30 (2 replicates)

2.2.2 1KG_full

  • 1 outgroup: 9’ (2 replicates)
  • 2 outgroups: 55’ (2 replicates)

2.3 Non-CpG : est-sfs (model=kimura)

2.3.1 1KG_sampled199

  • 1 outgroup: 25’ (5 replicates)
  • 2 outgroups: 4h10 (5 replicates)

2.3.2 1KG_full

  • 1 outgroup: 3h (5 replicates)
  • 2 outgroups: 62h (5 replicates)

2.4 Non-CpG : est-sfs (model=rate6)

2.4.1 1KG_sampled199

  • 1 outgroup: 54’ (10 replicates)
  • 2 outgroups: 12h20 (10 replicates)

2.4.2 1KG_full

  • 1 outgroup: 20h (10 replicates)
  • 2 outgroups: >96h (10 replicates) => not finished

3. Summary

For each dataset (1KG_sampled199 or 1KG_full, CpG or non-CpG) and outgroup setting (1 or 2 outgroups), I compare the ML obtained with different methods (JC, kimura, rate6). I also estimated the computing time (in hours) that would be required to run est-sfs on the entire genome (on 1 cpu).

CpG sites, 1KG_sampled199

outgroup config delta_ML Rank TimeGenome_1cpu
1outgroup JC -74297 3 8 h
1outgroup kimura -8877 2 36 h
1outgroup rate6 0 1 367 h
outgroup config delta_ML Rank TimeGenome_1cpu
2outgroups JC -97807 3 75 h
2outgroups kimura 0 1 131 h
2outgroups rate6 -23465 2 1094 h

noCpG sites, 1KG_sampled199

outgroup config delta_ML Rank TimeGenome_1cpu
1outgroup JC -5704 2 255 h
1outgroup kimura 0 1 408 h
1outgroup rate6 -265410 3 951 h
outgroup config delta_ML Rank TimeGenome_1cpu
2outgroups JC -6623 3 2616 h
2outgroups kimura -180 2 4280 h
2outgroups rate6 0 1 12518 h

CpG sites, 1KG_full

outgroup config delta_ML Rank TimeGenome_1cpu
1outgroup JC -64028 2 100 h
1outgroup kimura 0 1 284 h
1outgroup rate6 -155956 3 4438 h
outgroup config delta_ML Rank TimeGenome_1cpu
2outgroups JC -99763 2 538 h
2outgroups kimura 0 1 1709 h
2outgroups rate6 NA NA >12229 h

noCpG sites, 1KG_full

outgroup config delta_ML Rank TimeGenome_1cpu
1outgroup JC -5935 3 1155 h
1outgroup kimura -385 2 3142 h
1outgroup rate6 0 1 20569 h
outgroup config delta_ML Rank TimeGenome_1cpu
2outgroups JC -6349 2 13146 h
2outgroups kimura 0 1 63948 h
2outgroups rate6 NA NA >146750 h

Conclusion

For a given dataset (1KG_sampled199 or 1KG_full), in terms of ML, the kimura model always performs better than JC. The rate6 model is sometimes better than kimura, but not systematically, and it is even sometimes worse than JC, possibly because of convergence issues. It should be possible to improve results of the rate6 model by increasing the number of replicates (I used nrandom=10, as recommended), but this would further increase computation time. Thus, the kimura model appears to be a good tradeoff between computation time and inferrence quality.

The subsampling strategy (1KG_sampled199) strongly reduces the computation time, and gives results that seem at least as good as with the entire dataset. On non-CpG sites, it even seems that the subsampling strategy is more efficient to get rid of polarization errors (we observe a lower fraction of high DAF SNPs in the inferred SFSs). I do not understand the reasons for this effect.

The fact of using 2 outgroups instead of only 1, strongly increases computation time (x10 on non-CpG sites). Based on visual inspection of SFSs, it is difficult to say whether results are improved by using 2 outgroups. But the analyses by Keightley & Jackson (2018) suggest that it is the case.

=> the good option seems to use the subsampling strategy with kimura model, and 2 outgroups.

By extrapolation, we can estimate that this strategy would require ~4400h of cpu (i.e. less than 2 days if we run it in parallel on 100 cpus).

Question: for the kimura model, I used nrandom=5, but it seems that all iterations give almost identical results => could I use simply nrandom=1 to save computation time (divided by 5) ?

Thu May 26 08:35:24 2022