Using SNPknock with Genetic Data

Matteo Sesia (msesia@stanford.edu)

2018-05-30

This vignette illustrates the usage of the SNPknock package in combination with the imputation software fastPHASE to create knockoff copies of unphased genotypes or phased haplotypes (Sesia, Sabatti, and Candès 2017). Since fastPHASE is not available as an R package, this particular functionality of SNPknock requires the user to first obtain a copy of fastPHASE.

Obtaining fastPHASE

fastPHASE is a program whose purpose is to estimate missing genotypes and unobserved haplotypes. Its underlying algorithm is based on the hidden Markov model described in (Scheet and Stephens 2006).

Binary executables for Linux and Mac OS are available from http://scheet.org/software.html.

Before continuing with this tutorial, download and extract the fastPHASE tarball from the above link and move the fastPHASE executable file into a convenient directory (e.g. “~/bin/”).

Knockoffs for unphased genotypes

Fitting the hidden Markov model on genotype data

A small synthetic dataset of 1454 unphased genotype SNPs from 100 individuals can be found in the package installation directory. We can load it with:

library(SNPknock)
X_file = system.file("extdata", "genotypes.RData", package = "SNPknock")
load(X_file)
table(X)
## X
##     0     1     2 
## 77522 52963 14915

Below, we show how to fit a hidden Markov model to this data, with the help of fastPHASE. Since fastPHASE takes as input genotype sequences in “.inp” format, we must first convert the X matrix by calling SNPknock.fp.writeX. By default, this function will write onto a temporary file in the R temporary directory.

# Convert X into the suitable fastPHASE input format, write it into a temporary file
# and return the path to that file.
Xinp_file = SNPknock.fp.writeX(X)

Assuming that we have already downloaded fastPHASE, we can call it to fit the hidden Markov model to X.

fp_path  = "~/bin/fastPHASE" # Path to the fastPHASE executable
# Call fastPHASE and return the path to the parameter estimate files
fp_outPath = SNPknock.fp.runFastPhase(fp_path, Xinp_file, K = 12, numit = 15)

Above, the SNPknock package could not find fastPHASE because we did not provide the correct path (we cannot include third-party executable files within this package). However, if you install fastPHASE separately and provide SNPknock with the correct path, this will work.

If the previous step worked for you, you can find the parameter estimates produced by fastPHASE in the following files:

r_file = paste(fp_outPath, "_rhat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
char_file  = paste(fp_outPath, "_origchars", sep="")

Otherwise, for the sake of this tutorial, you can use the example parameter files provided in the package installation directory:

r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock")
char_file  = system.file("extdata", "genotypes_origchars", package = "SNPknock")

Then, we can construct the hidden Markov model with:

hmm = SNPknock.fp.loadFit(r_file, alpha_file, theta_file, char_file)

Generating knockoff genotypes

Finally, we can use the hidden Markov model created above to generate knockoffs.

Xk = SNPknock.knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
table(Xk)
## Xk
##     0     1     2 
## 77226 53625 14549

Knockoffs for phased haplotypes

Fitting the hidden Markov model on haplotype data

A small synthetic dataset of 1454 phased haplotype SNPs from 100 individuals can be found in the package installation directory. We can load it with:

library(SNPknock)
H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock")
load(H_file)
table(H)
## H
##      0      1 
## 208007  82793

Below, we show how to fit a hidden Markov model to this data, with the help of fastPHASE. Since fastPHASE takes as input haplotype sequences in “.inp” format, we must first convert the H matrix by calling SNPknock.fp.writeX. By default, this function will write onto a temporary file in the R temporary directory.

# Convert X into the suitable fastPHASE input format, write it into a temporary file
# and return the path to that file.
Hinp_file = SNPknock.fp.writeX(H, phased = TRUE)

Assuming that we have already downloaded fastPHASE, we can call it to fit the hidden Markov model to X.

fp_path  = "~/bin/fastPHASE" # Path to the fastPHASE executable
# Call fastPHASE and return the path to the parameter estimate files
fp_outPath = SNPknock.fp.runFastPhase(fp_path, Hinp_file, K = 12, numit = 15, phased = TRUE)

Above, the SNPknock package could not find fastPHASE because we did not provide the correct path (we cannot include third-party executable files within this package). However, if you install fastPHASE separately and provide SNPknock with the correct path, this will work.

If the previous step worked for you, you can find the parameter estimates produced by fastPHASE in the following files:

r_file = paste(fp_outPath, "_rhat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
char_file  = paste(fp_outPath, "_origchars", sep="")

Otherwise, for the sake of this tutorial, you can use the example parameter files provided in the package installation directory:

r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock")
char_file  = system.file("extdata", "haplotypes_origchars", package = "SNPknock")

Then, we can construct the hidden Markov model with:

hmm = SNPknock.fp.loadFit(r_file, alpha_file, theta_file, char_file)

Generating knockoff haplotypes

Finally, we can use the hidden Markov model created above to generate knockoffs.

Hk = SNPknock.knockoffHaplotypes(H, hmm$r, hmm$alpha, hmm$theta)
table(Hk)
## Hk
##      0      1 
## 207906  82894

See also

If you want to see some basic usage of SNPknock, see the introductory vignette.

Scheet, Paul, and Matthew Stephens. 2006. “A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase.” Am J Hum Genet 78 (4). The American Society of Human Genetics: 629–44. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1424677/.

Sesia, M., C. Sabatti, and E. J. Candès. 2017. “Gene Hunting with Knockoffs for Hidden Markov Models.” ArXiv E-Prints, June. https://arxiv.org/abs/1706.04677.