```
# this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
knitr::opts_chunk$set(eval = FALSE)
}
```

`## Loading required package: snpStats`

`## Loading required package: survival`

`## Loading required package: Matrix`

```
##
## Attaching package: 'Matrix'
```

```
## The following object is masked from 'package:S4Vectors':
##
## expand
```

In this vignette we demonstrate the use of `snpClust`

function in the `adjclust`

package. `snpClust`

performs adjacency-constrained hierarchical clustering of single nucleotide polymorphisms (SNPs), where the similarity between SNPs is defined by linkage disequilibrium (LD).

This function implements the algorithm described in the third chapter of [2]. It is an extension of the algorithm described in [3]. Denoting by \(p\) the number of SNPs to cluster and assuming that the similarity between SNPs whose indices are more distant than \(h\), its time complexity is \(O(p (\log(p) + h))\), and its space complexity is \(O(hp)\).

`library("adjclust")`

The begining of this vignette closely follows the “LD vignette” of the SnpStats package [1]. First, we load genotype data.

```
library("matrixStats")
data("ld.example", package = "snpStats")
```

We focus on the `ceph.1mb`

data.

```
geno <- ceph.1mb[, -316] ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
```

```
## A SnpMatrix with 90 rows and 602 columns
## Row names: NA06985 ... NA12892
## Col names: rs5993821 ... rs5747302
```

These data are drawn from the International HapMap Project and concern 602 SNPs^{1} over a 1Mb region of chromosome 22 in sample of 90 Europeans.

We can compute and display the LD between these SNPs.

```
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)
```

The `snpClust`

function can handle genotype data as an input:

`fit <- snpClust(geno, stats = "R.squared")`

```
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity
## to be smaller than or equal to 1
```

`## Note: 135 merges with non increasing heights.`

Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally.

The above figure suggests that the LD signal is concentrated close to the diagonal. We can focus on a diagonal band with the bandwidth parameter `h`

:

`fitH <- snpClust(geno, h = 100, stats = "R.squared")`

```
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity
## to be smaller than or equal to 1
```

`## Note: 133 merges with non increasing heights.`

`fitH`

```
##
## Call:
## run.adjclust(mat = mat, type = type, h = h)
##
## Cluster method : snpClust
## Number of objects: 602
```

The output of the `snpClust`

is of class `chac`

. In particular, it can be plotted as a dendrogram silently relying on the function `plot.dendrogram`

:

`plot(fitH, type = "rectangle", leaflab = "perpendicular")`

```
## Warning in plot.chac(fitH, type = "rectangle", leaflab = "perpendicular"):
## Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.
```

Moreover, the output contains an element named `merge`

which describes the successive merges of the clustering, and an element `gains`

which gives the improvement in the criterion optimized by the clustering at each successive merge.

`head(cbind(fitH$merge, fitH$gains))`

```
## [,1] [,2]
## [1,] -1 -2
## [2,] -255 -256
## [3,] -488 -489
## [4,] -487 3
## [5,] -486 4
## [6,] -234 -235
```

In this section we show how the `snpClust`

function can also be applied directly to LD values.

```
h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE)
image(ld.ceph, lwd = 0)
```

Note that we have forced the `snpStats::ld`

function to return a symmetric matrix. We can apply `snpClust`

directly to this LD matrix (of class `Matrix::dsCMatrix`

):

`fitL <- snpClust(ld.ceph, h)`

`## Note: forcing the diagonal of the LD similarity matrix to be 1`

```
## Warning in run.snpClust(x, h = h, stats = stats): Forcing the LD similarity
## to be smaller than or equal to 1
```

`## Note: 133 merges with non increasing heights.`

`snpClust`

also handles inputs of class `base::matrix`

:

```
gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
```

`## Note: 133 merges with non increasing heights.`

[1] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods. R package version 1.20.0

[2] Dehman A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies. Phd Thesis, Université Paris Saclay.

[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. *BMC Bioinformatics*, **16**, 148.

`sessionInfo()`

```
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] matrixStats_0.54.0 snpStats_1.36.0 Matrix_1.2-18
## [4] survival_3.1-12 adjclust_0.5.99 HiTC_1.30.0
## [7] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.0
## [10] S4Vectors_0.24.0 BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 compiler_3.6.3
## [3] RColorBrewer_1.1-2 XVector_0.26.0
## [5] bitops_1.0-6 tools_3.6.3
## [7] zlibbioc_1.32.0 digest_0.6.20
## [9] evaluate_0.14 lattice_0.20-41
## [11] DelayedArray_0.12.0 yaml_2.2.0
## [13] xfun_0.13 GenomeInfoDbData_1.2.2
## [15] rtracklayer_1.46.0 stringr_1.4.0
## [17] knitr_1.23 Biostrings_2.54.0
## [19] grid_3.6.3 Biobase_2.46.0
## [21] XML_3.98-1.20 BiocParallel_1.20.0
## [23] rmarkdown_1.14 magrittr_1.5
## [25] splines_3.6.3 Rsamtools_2.2.1
## [27] htmltools_0.3.6 GenomicAlignments_1.22.1
## [29] MASS_7.3-51.6 SummarizedExperiment_1.16.0
## [31] capushe_1.1.1 stringi_1.4.3
## [33] RCurl_1.95-4.12
```

We have dropped SNP rs2401075 because it produced a missing value due to the lack of genetic diversity in the considered sample.↩