In some settings, we don’t have access to the full data unit on each observation in our sample. These “coarsened-data” settings (see, e.g., Van der Vaart (2000)) create a layer of complication in estimating variable importance. In particular, the efficient influence function (EIF) in the coarsened-data setting is more complex, and involves estimating an additional quantity: the projection of the full-data EIF (estimated on the fully-observed sample) onto the variables that are always observed (Chapter 25.5.3 of Van der Vaart (2000); see also Example 6 in Williamson, Gilbert, Simon, et al. (2021)).

`vimp`

`vimp`

can handle coarsened data, with the specification
of several arguments:

`C`

: and binary indicator vector, denoting which observations have been coarsened; 1 denotes fully observed, while 0 denotes coarsened.`ipc_weights`

: inverse probability of coarsening weights, assumed to already be inverted (i.e.,`ipc_weights`

= 1 / [estimated probability of coarsening]).`ipc_est_type`

: the type of procedure used for coarsened-at-random settings; options are`"ipw"`

(for inverse probability weighting) or`"aipw"`

(for augmented inverse probability weighting). Only used if`C`

is not all equal to 1.`Z`

: a character vector specifying the variable(s) among`Y`

and`X`

that are thought to play a role in the coarsening mechanism. To specify the outcome, use`"Y"`

; to specify covariates, use a character number corresponding to the desired position in`X`

(e.g.,`"1"`

or`"X1"`

[the latter is case-insensitive]).

`Z`

plays a role in the additional estimation mentioned
above. Unless otherwise specified, an internal call to
`SuperLearner`

regresses the full-data EIF (estimated on the
fully-observed data) onto a matrix that is the parsed version of
`Z`

. If you wish to use any covariates from `X`

as
part of your coarsening mechanism (and thus include them in
`Z`

), and they have *different names from X1,
…*, then you must use character numbers (i.e.,

`"1"`

refers to the first variable, etc.) to refer to the variables to include
in `Z`

. Otherwise, `vimp`

will throw an error.In this example, the outcome `Y`

is subject to
missingness. We generate data as follows:

```
set.seed(1234)
<- 2
p <- 100
n <- replicate(p, stats::rnorm(n, 0, 1))
x # apply the function to the x's
<- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
y # indicator of observing Y
<- .01 * x[, 1] + .05 * x[, 2] - 2.5
logit_g_x <- exp(logit_g_x) / (1 + exp(logit_g_x))
g_x <- rbinom(n, size = 1, prob = g_x)
C <- y
obs_y == 0] <- NA
obs_y[C <- as.data.frame(x)
x_df <- data.frame(Y = obs_y, x_df, C = C) full_df
```

Next, we estimate the relevant components for `vimp`

:

```
library("vimp")
library("SuperLearner")
# estimate the probability of missing outcome
<- 1 / predict(glm(C ~ V1 + V2, family = "binomial", data = full_df),
ipc_weights type = "response")
# set up the SL
<- c("SL.glm", "SL.mean")
learners <- 2
V
# estimate vim for X2
set.seed(1234)
<- vim(Y = obs_y, X = x_df, indx = 2, type = "r_squared", run_regression = TRUE,
est SL.library = learners, alpha = 0.05, delta = 0, C = C, Z = c("Y", "1"),
ipc_weights = ipc_weights, cvControl = list(V = V))
```

`## Warning: All algorithms have zero weight`

```
## Warning: All metalearner coefficients are zero, predictions will all be equal to
## 0
## Warning: All metalearner coefficients are zero, predictions will all be equal to
## 0
```

In this example, we observe outcome `Y`

and covariate
`X1`

on all participants in a study. Based on the value of
`Y`

and `X1`

, we include some participants in a
second-phase sample, and further measure covariate `X2`

on
these participants. This is an example of a two-phase study. We generate
data as follows:

```
set.seed(4747)
<- 2
p <- 100
n <- replicate(p, stats::rnorm(n, 0, 1))
x # apply the function to the x's
<- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
y # make this a two-phase study, assume that X2 is only measured on
# subjects in the second phase; note C = 1 is inclusion
<- rbinom(n, size = 1, prob = exp(y + 0.1 * x[, 1]) / (1 + exp(y + 0.1 * x[, 1])))
C <- x
tmp_x == 0, 2] <- NA
tmp_x[C <- tmp_x
x <- as.data.frame(x)
x_df <- data.frame(Y = y, x_df, C = C) full_df
```

If we want to estimate variable importance of `X2`

, we
need to use the coarsened-data arguments in `vimp`

. This can
be accomplished in the following manner:

```
library("vimp")
library("SuperLearner")
# estimate the probability of being included only in the first phase sample
<- 1 / predict(glm(C ~ y + V1, family = "binomial", data = full_df),
ipc_weights type = "response")
# set up the SL
<- c("SL.glm")
learners <- 2
V
# estimate vim for X2
set.seed(1234)
<- vim(Y = y, X = x_df, indx = 2, type = "r_squared", run_regression = TRUE,
est SL.library = learners, alpha = 0.05, delta = 0, C = C, Z = c("Y", "1"),
ipc_weights = ipc_weights, cvControl = list(V = V), method = "method.CC_LS")
```

`## Loading required package: quadprog`

`## Warning: package 'quadprog' was built under R version 4.0.3`

Van der Vaart, AW. 2000. “Asymptotic Statistics” 3.

Williamson, BD, PB Gilbert, NR Simon, et al. 2021. “A General
Framework for Inference on Algorithm-Agnostic Variable
Importance.” *Journal of the American Statistical
Association*. https://doi.org/10.1080/01621459.2021.2003200.