R package {bigstatsr} provides functions for fast statistical
analysis of large-scale data encoded as matrices. The package can handle
matrices that are too large to fit in memory thanks to memory-mapping to
binary files on disk. This is very similar to the format
`big.matrix`

provided by R package
{bigmemory}, which is **no longer used** by this
package (see the
corresponding vignette). As inputs, package {bigstatsr} uses Filebacked
Big Matrices (FBM).

**Note that most of the algorithms of this package don’t handle
missing values.**

```
# For the CRAN version
install.packages("bigstatsr")
# For the latest version
::install_github("privefl/bigstatsr") remotes
```

```
library(bigstatsr)
# Create the data on disk
<- FBM(5e3, 10e3, backingfile = "test")$save()
X # If you open a new session you can do
<- big_attach("test.rds")
X
# Fill it by chunks with random values
<- matrix(0, nrow(X), 5); U[] <- rnorm(length(U))
U <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V))
V <- nb_cores()
NCORES # X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) {
<- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
X[, ind] NULL ## you don't want to return anything here
a.combine = 'c', ncores = NCORES, U = U, V = V)
}, # Check some values
1:5, 1:5]
X[
# Compute first 10 PCs
<- big_randomSVD(X, fun.scaling = big_scale(),
obj.svd k = 10, ncores = NCORES)
plot(obj.svd)
# Cleanup
unlink(paste0("test", c(".bk", ".rds")))
```

Learn more with this introduction to package {bigstatsr}.

If you want to use Rcpp code, look at this tutorial.

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.

Computing the null space of a big matrix (works if one dimension is not too large)

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr}, please open an issue as well or
post on Stack Overflow with the tag *bigstatsr*.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

Privé, Florian, et al. “Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.” Bioinformatics 34.16 (2018): 2781-2787.

Privé, Florian, Hugues Aschard, and Michael GB Blum. “Efficient implementation of penalized regression for genetic risk prediction.” Genetics 212.1 (2019): 65-74.