We provide several functions for Monte Carlo simulations to assess
the performance of the outlier detection algorithm
`outlier_detection()`

and the various statistical tools such
as `outliers_prop()`

. The simulations can be executed in
parallel using various backends.

Monte Carlo simulations involve the following steps:

- create or choose a
*true*2SLS model (including parameters) - specify the outlier detection algorithm to be analysed
*optionally*choose which simulation parameters to vary, such as the sample size- choose whether to execute the simulations sequentially or in parallel and run the simulations

See the vignette *Introduction to the robust2sls Package* for
more details on the model setup (step 1) and the different algorithms
(step 2) that are implemented.

`::vignette("overview", package = "robust2sls") utils`

We conceptualise data as being generated by some *true* model,
the so-called data-generating process (DGP). Specifying a DGP ourselves
in simulations, allows us to check whether the theory works in practice.
For example, we could use Monte Carlo simulations to check whether the
2SLS estimator recovers the true parameters; whether the proportion of
detected outliers corresponds to the expected proportion; or whether a
statistical test has expected size even in finite samples.

First of all, we need to specify a valid 2SLS model and its
parameters. The function `generate_param()`

can be used to
generate random parameters of a 2SLS model that fulfill the 2SLS
conditions. For instance, the parameters are created such that the
structural error is uncorrelated to the instruments. Instead of random
parameters, they can also - partly or fully - be specified by the
user.

```
library(robust2sls)
<- generate_param(dx1 = 3, dx2 = 2, dz2 = 3, intercept = TRUE, seed = 10) p
```

Here, we create parameters for a model with 3 exogenous and 2 endogenous regressors, and 3 outside instruments. The model includes an intercept, so one of the exogenous instruments is simply a constant. The parameters are stored in a list.

Structural equation: \(y_{i} = \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \beta_{3} x_{3,i} + \beta_{4} x_{4,i} + \beta_{5} x_{5,i} + u_{i}\)

First stage: \(x_{i} = \Pi^{\prime} z_{i} + r_{i}\),

where the vector \(x_{i}\) contains all the regressors and the vector of instruments \(z_{i}\) contains the 3 exogenous regressors and the two excluded instruments. \(\Pi\) is the matrix of first stage coefficients.

The workhorse command for different types of trimmed 2SLS algorithms
in the *robust2sls* package is `outlier_detection()`

.
The main decisions are

- which initial estimator to use
- how the sample is trimmed, which is governed by
- the reference distribution against which the errors are judged to be outliers or not
- the cut-off value \(c\) that determines the size of the standardised errors beyond which observations are labelled as outliers and subsequently removed

- how often the algorithm is iterated, which is represented by the parameter \(m\).

To keep things simple and the run-time of the simulations low, we do
not iterate the algorithm in this example. We use the Robustified 2SLS
algorithm, which uses the full sample for the initial estimates. As is
commonly done, we use the normal distribution as the reference
distribution. To target a *false outlier detection rate* of
approximately 5%, we choose a cut-off value of approximately 1.96,
meaning that observations with an absolute standardised residual larger
than 1.96 are classified as outliers. This is set using the
`sign_level`

argument of the function, which together with
the reference distribution, `ref_dist`

, automatically
determines the cut-off value.

The simulation function `mc_grid()`

also takes these
arguments and internally uses them to call the
`outlier_detection()`

function repeatedly across
replications.

Again to keep the run-time low, we only vary the sample size. We choose small sample sizes of 50 and 100, respectively.

The functions `mc()`

and `mc_grid()`

are
designed to be used either sequentially or in parallel. They are
implemented using the foreach package.
To ensure that the results are reproducible across different ways of
executing the simulations (sequentially or parallel; within the latter
as multisession, multicore, cluster etc.), the package doRNG is used to
execute the loops.

The Monte Carlo functions leave the registration of the foreach adaptor to the end-user. For example, both the packages doParallel and doFuture can be used.

We first consider running the Monte Carlo simulation in parallel. We
set the number of cores and create the cluster. Note that CRAN only
allows for at most two cores, so the code limits the number of cores.
For `registerDoParallel()`

, we need to export the functions
that are used within `mc_grid()`

explicitly. With
`registerDoFuture()`

, it should not be necessary to
explicitly export variables or packages because it identifies them
automatically via static code inspection.

```
library(parallel)
<- 2
ncores <- makeCluster(ncores)
cl # export libraries to all workers in the cluster
invisible(clusterCall(cl = cl, function(x) .libPaths(x), .libPaths()))
```

First, we use the doParallel package to run the simulations in parallel.

```
library(doParallel)
#> Loading required package: foreach
#> Loading required package: iterators
registerDoParallel(cl)
<- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p,
sim1 formula = p$setting$formula, ref_dist = "normal",
sign_level = 0.05, initial_est = "robustified", iterations = 0,
shuffle = FALSE, shuffle_seed = 42, split = 0.5)
```

Next, we use the doFuture package for the parallel loop. Both implementations yield the same result.

```
library(doFuture)
#> Loading required package: future
registerDoFuture()
plan(cluster, workers = cl)
<- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p,
sim2 formula = p$setting$formula, ref_dist = "normal",
sign_level = 0.05, initial_est = "robustified", iterations = 0,
shuffle = FALSE, shuffle_seed = 42, split = 0.5)
stopCluster(cl)
# check identical results
identical(sim1, sim2)
#> [1] TRUE
```

To run the loop sequentially, we can again use the doFuture package but this time setting a different plan. The doRNG ensures that the results are identical to those from the parallel loops.

```
library(doFuture)
registerDoFuture()
plan(sequential)
<- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p,
sim3 formula = p$setting$formula, ref_dist = "normal",
sign_level = 0.05, initial_est = "robustified", iterations = 0,
shuffle = FALSE, shuffle_seed = 42, split = 0.5)
#> Loading required package: rngtools
# check identical results
identical(sim1, sim3)
#> [1] TRUE
```