This vignette showcases the functions `regressionImp()`

and `rangerImpute()`

, which can both be used to generate
imputations for several variables in a dataset using a formula
interface.

For data, a subset of `sleep`

is used. The columns have
been selected deliberately to include some interactions between the
missing values.

```
library(VIM)
library(magrittr)
<- sleep[, c("Dream", "NonD", "BodyWgt", "Span")]
dataset $BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
datasetaggr(dataset)
```

```
str(dataset)
#> 'data.frame': 62 obs. of 4 variables:
#> $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
#> $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
#> $ BodyWgt: num 8.803 0 1.2194 -0.0834 7.8427 ...
#> $ Span : num 3.65 1.5 2.64 NA 4.23 ...
```

In order to invoke the imputation methods, a formula is used to
specify which variables are to be estimated and which variables should
be used as regressors. We will start by imputing `NonD`

based
in `BodyWgt`

and `Span`

.

```
<- regressionImp(NonD ~ BodyWgt + Span, dataset)
imp_regression #> There still missing values in variable NonD . Probably due to missing values in the regressors.
<- rangerImpute(NonD ~ BodyWgt + Span, dataset)
imp_ranger aggr(imp_regression, delimiter = "_imp")
```

We can see that for `regrssionImp()`

there are still
missings in `NonD`

for all observations where
`Span`

is unobserved. This is because the regression model
could not be applied to those observations. The same is true for the
values imputed via `rangerImpute()`

.

As we can see in the next two plots, the correlation structure of
`NonD`

and `BodyWgt`

is preserved by both
imputation methods. In the case of `regressionImp()`

all
imputed values almost follow a straight line. This suggests that the
variable `Span`

had little to no effect on the model.

```
c("NonD", "BodyWgt", "NonD_imp")] %>%
imp_regression[, marginplot(delimiter = "_imp")
```

For `rangerImpute()`

on the other hand, `Span`

played an important role in the generation of the imputed values.

```
c("NonD", "BodyWgt", "NonD_imp")] %>%
imp_ranger[, marginplot(delimiter = "_imp")
```

```
c("NonD", "Span", "NonD_imp")] %>%
imp_ranger[, marginplot(delimiter = "_imp")
```

To impute several variables at once, the formula in
`rangerImpute()`

and `regressionImp()`

can be
specified with more than one column name in the left hand side.

```
<- regressionImp(Dream + NonD ~ BodyWgt + Span, dataset)
imp_regression #> There still missing values in variable Dream . Probably due to missing values in the regressors.
#> There still missing values in variable NonD . Probably due to missing values in the regressors.
<- rangerImpute(Dream + NonD ~ BodyWgt + Span, dataset)
imp_ranger aggr(imp_regression, delimiter = "_imp")
```

Again, there are missings left for both `Dream`

and
`NonD`

.

In order to validate the performance of `regressionImp()`

the `iris`

dataset is used. Firstly, some values are randomly
set to `NA`

.

```
library(reactable)
data(iris)
<- iris
df colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
<- 50
nbr_missing <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
y col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
<-y[!duplicated(y),]
yas.matrix(y)]<-NA
df[
aggr(df)
```

```
sapply(df, function(x)sum(is.na(x)))
#> S.Length S.Width P.Length P.Width Species
#> 12 10 13 12 0
```

We can see that there are missings in all variables and some
observations reveal missing values on several points. In the next step
we perform a multiple variable imputation and `Species`

serves as a regressor.

```
<- regressionImp(S.Length + S.Width + P.Length + P.Width ~ Species, df)
imp_regression aggr(imp_regression, delimiter = "imp")
```

The plot indicates that all missing values have been imputed by the
`regressionImp()`

algorithm. The following table displays the
rounded first five results of the imputation for all variables.