# Introduction to the mzipmed R package

In this document we explain the main features of the mzipmed package through examples. This package performs mediation analysis when either the outcome or mediator are a count variable with many zeroes, called a zero-inflated count. Specifically, this package merges the concepts of the counterfactual approach to mediation and the marginalized zero-inflated Poisson (MZIP) model to obtain mediation effects when there are excess zeroes. Additional information on statistical methodology is provided elsewhere (Long et al. 2014, Vanderweele 2016)

# Brief introduction to counterfactual approach to mediation

Mediation analysis is used to explore how a variable called a mediator helps explain the relationship between some exposure and an outcome. Many popular mediation method such as the difference and product methods collapse when there are interaction effects or the outcome is not a continuous variable. The counterfactual approach to mediation involves finding mediation effects through closed-form expressions that is extendable to many different classes of models and can account for interaction effects while minimizing computation speed. If we let $$Y$$ be our outcome, $$M$$ be our mediator, $$X$$ be our exposure, and $$C$$ be a vector of covariates. The counterfactual approach involves fitting two regression models, one for the outcome and one for the mediator.

$g(E[Y|X=x,M=m,\textbf{C=c}])=g(E[Y|x,m,c])=\tau_0+\tau_1x+\tau_2m+\mathbf{ \tau_4'c}$ $h(E[M|X=x,\textbf{C=c}])=h(E[M|x,c])=\theta_0+\theta_1x++\mathbf{ \theta_4'c}$ Where $$g$$ and $$h$$ are link functions associated with the outcome and mediator model respectively and $$\tau$$ and $$\theta$$ are the parameter estimates associated with the outcome and mediator model respectively. Using parameter estimates from these equations mediation effects can be estimated including the total effect (TE) or the overall effect of the exposure on the outcome, natural direct effect (NDE) or the effect of the exposure on the outcome not operating through the mediator, and natural indirect effect (NIE) or the effect operating through the mediator. Basic formulas for these equations are as follows $NDE=\sum_m(E[Y|x,m,c]-E[Y|x^*,m,c])P(m|x^*,c)$ $NIE=\sum_m(E[Y|x,m,c])(P(m|x,c)-P(m|x^*,c))$ Where $$x^*$$ is a fixed value of the exposure to be compared to $$x$$. The total effect can then be computed as the sum of NDE and NIE on a difference scale and the product of the NDE and NIE if the outcome is on a ratio scale.

# Brief overview of MZIP

Standard count regression models, such as Poisson, give biased parameter estimates and popular count models like the zero-inflated Poisson model do not directly provide population average parameter estimates. In response the marginalized zero-inflated Poisson (MZIP) model was proposed. MZIP is a mixture distribution that models the probability of being an excess zero from a Bernoulli distribution with probability $$\psi_i$$ and the overall mean of the distribution from a Poisson distribution with mean $$\nu_i$$, where $$\nu_i$$ is the product of the mean of the non-excess zeroes, $$\mu_i$$, and $$1-\psi_i$$. MZIP is fit by $logit(\psi_i)=\mathbf{Z_i'\gamma}$ $log(\nu_i)=\mathbf{Z_i'\alpha}$ Where $$\alpha$$ and $$\gamma$$ are $$(p\times 1)$$ column vectors of parameters associated with the overall population mean and probability of being an excess zero models. $$\mathbf{Z_i}$$ is a $$(p\times 1)$$ vector of covariates for the i-th individual for the excess zero and overall population mean components of MZIP. Exponentiating $$\alpha$$ parameters yield incidence density or incidence rate ratio interpretations equivalent to Poisson model interpretations.

# mzipmed_data overview

Included in this package is a simulated dataset for examples called mzipmed_data. This dataset includes 500 observations of 10 variables. These variables are

• X is a binary exposure simulated from a Bernoulli distribution with p=0.5
• C1 is a covariate that follows a standard normal distribution
• C2 is a second covariate from a ~Beta(2,2)
• ziM is a zero-inflated count mediator simulated from the predictors X,C1, and C2
• lmM is a continuous mediator based on the predictors with error term ~N(0,4)
• binM is a binary mediator based on the predictors simulated with a Bernoulli distribution
• lmY is a continuous outcome simulated from ziM and the predictors with error term ~N(0,4)
• binY is a binary outcome simulated from ziM and the predictors with a Bernoulli distribution
• ziY1 is a zero-inflated count outcome simulated from lmM and the predictors
• ziY2 is a another zero-inflated count outcome, but instead uses binM and the predictors

# mzip() function for analysis with zero-inflated count variables

MZIP can be performed with the mzip() function. For example, assume we would like to observe how our X is associated with ziY1from mzipmed_data. First we need to load the mzipmed package

library(mzipmed)

Next we can use the mzip() function to conduct this analysis. First we need to specify the y argument which in our case y=mzipmed_data$ziY1. For the covariates we specify the pred argument. For a single predictor we could use pred=mzipmed_data$X. Suppose we adjust for C1 and C2 then we bind all the predictors into a vector using the cbind() function in base R, e.g. pred=cbind(mzipmed_data$X,mzipmed_data$C1,mzipmed_data$C2). If we want our covariates named in the outcome we need to specify within the cbind() function, e.g. pred=cbind(X=mzipmed_data$X,C1=mzipmed_data$C1,C2=mzipmed_data$C2) otherwise they will be named X1,X2,… in order they are included. The final argument is print=FALSE, if print=TRUE parameter estimates, standard errors, p-values, and risk ratios will be printed into the R console. The default is TRUE. In additional an offset variable can be specified using the offset argument, e.g. offset=variable. No offset is included in this dataset. Using this example

test=mzip(y=mzipmed_data$ziY1, pred=cbind(X=mzipmed_data$X,C1=mzipmed_data$C1,C2=mzipmed_data$C2), offset=NULL, print=TRUE)
#>  "Overall Mean Estimates from MZIP"
#>    Variable Alpha_Estimate    SE P_Value Robust_SE Robust_P_Value
#> 1 intercept         -0.520 0.200  0.0092     0.237         0.0282
#> 2         X          0.249 0.150  0.0974     0.209         0.2340
#> 3        C1          0.064 0.069  0.3605     0.073         0.3845
#> 4        C2          0.886 0.322  0.0059     0.366         0.0153
#>  "Excess Zero Estimates from MZIP"
#>    Variable Gamma_Estimate    SE P_Value Robust_SE Robust_P_Value
#> 1 intercept          0.515 0.259  0.0469     0.264         0.0508
#> 2         X          0.288 0.199  0.1495     0.205         0.1614
#> 3        C1          0.040 0.085  0.6436     0.081         0.6273
#> 4        C2         -0.095 0.410  0.8172     0.399         0.8124
#>  "Overall Mean Incidence Rate Ratio Estimates"
#>   Variable incrate_ratio Lower_CI Upper_CI RobLower_CI RobUpper_CI
#> 1        X         1.283    0.956    1.722       0.851       1.934
#> 2       C1         1.066    0.930    1.221       0.923       1.230
#> 3       C2         2.426    1.291    4.558       1.185       4.967

We find no evidence that X is significantly associated with the outcome and we obtain a relative risk estimate of 1.28. The robust standard errors are often used if over-dispersion is a concern as they typically provide wider confidence intervals. In the test output one can also find covariance matrices, parameter estimates, confidence intervals, etc.

# Mediation with Zero-Inflated Count Outcomes

## Continuous Mediators

Assume we want to examine how a continuous mediator explains the relationship between an exposure and a zero-inflated count mediator. We would fit the following models $logit(\psi_i|x,m,c)=\gamma_0+\gamma_1x+\gamma_2m+\mathbf{\gamma_4'c}$ $log(\nu_i|x,m,c)=\alpha_0+\alpha_1x+\alpha_2m+\mathbf{\alpha_4'c}$ $E(M|x,c)=\theta_0+\theta_1x+\mathbf{\theta_4'c}$ The overall mean outcome model is on a incidence rate ratio (IRR) scale so NDE and NIE estimates will be as well and can be computed by $IRR^{NDE}=e^{\alpha_1(x-x^*)}$ $IRR^{NIE}=e^{\alpha_2\theta_1(x-x^*)}$ To conduct this mediation analysis in the mzipmed package we would use the zioutlmmed() function. This functions only allows for a single outcome, mediator, and exposure at a time. Using the mzipmed data let $$X=X$$ and $$M=lmM$$ and our outcome is ziY1. We specify these variables with the following arguments outcome=mzipmed_data$ziY1,mediator=mzipmed_data$lmM, exposure=mzipmed_data$X, and for the covariates/confounders confounder=cbind(mzipmed_data$C1,mzipmed_data$C2). Standard errors can be computed using bootstrapping which is robust, but computationally intensive or via delta method for instantaneous variance estimation. For delta method use error="Delta" which is the default and for bootstrap use error="Boot". Robust standard errors can be requested within the delta method to be used for concerns of over-dispersion by specifying robust=TRUE; FALSE is the default. We also need to input values of the exposure to compare to using the X and Xstar arguments, for a dummy-coded binary exposure and by default we let X=1 and Xstar=0. These can be changed as needed. For bootstrap standard error you can specify the number of iterations, by default there are 1000 repetitions, n=1000. An offset variable for the ZI count outcome can be specified with the zioff argument using the same syntax as the outcome,exposure arguments. As an example ziout=zioutlmmed(outcome=mzipmed_data$ziY1,mediator=mzipmed_data$lmM,exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2), error="Delta", robust=FALSE,X=1,Xstar=0,n=1000, zioff=NULL)
#>                               Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR)      1.021  0.147    0.766    1.361
#> Natural Indirect Effect (IRR)    1.163  0.068    1.018    1.328
#> Total Effect (IRR)               1.187  0.162    0.865    1.630
#> Proportion Mediated              0.889

Printed into the R console are the incidence rate ratios of the NDE, NIE, and TE along with their standard errors and confidence intervals and the proportion mediated. From the ziout data we see no evidence of a significant direct effect of the exposure on the outcome as the incidence rate ratio includes 1 in the confidence interval, but do see that a significant effect operating through the mediator since 1 is not in the confidence interval. Additional information in the ziout in the R global environment include all information included in the outcome model using mzip() and the mediator model using lm() from the stats package in base R.

## Binary Mediators

The mzip package also includes a function for mediation with zero-inflated count outcomes and binary mediators. For this procedure a MZIP model is fit for the outcome and a logistic regression is fit for the mediator model using the glm() function in base R. The aforementioned function is called zioutbinmed(). The NDE is the same as with continuous mediators, but the NIE is now conditional on covariates

$IRR^{NIE}=\frac{([1+e^{\theta_0+\theta_1x^*+\mathbf{\theta_2'c}}][1+e^{\alpha_2 \theta_0+\theta_1x+\mathbf{\theta_2'c}}])}{([1+e^{\theta_0+\theta_1x+\mathbf{\theta_2'c}}][1+e^{\alpha_2\theta_0+\theta_1x^*+\mathbf{\theta_2'c}}])}$ Therefore fixed values of the covariates are required. The input for the zioutbinmed() function is the same as in the previous section, but also includes an argument to specify these covariates called C. By default, C=NULL each covariate will be fixed at its mean value to approximate marginal estimates, but can be changed using for example C=c(0,2). The remainder of the input and output for this function is the same as with the a continuous mediator and we now use outcome=mzipmed_data$ziY2 for the outcome variable. zioutb=zioutbinmed(outcome=mzipmed_data$ziY2,mediator=mzipmed_data$binM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2), error="Delta",  robust=FALSE,X=1, Xstar=0, n=1000, C=NULL, zioff=NULL)
#>                               Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR)      1.267  0.138    0.966    1.661
#> Natural Indirect Effect (IRR)    1.008  0.011    0.987    1.029
#> Total Effect (IRR)               1.277  0.138    0.974    1.674
#> Proportion Mediated              0.037

# Mediation with Zero-Inflated Count Mediators

## Continuous Outcomes with ZI Mediators

Assume we have a continuous outcome, $$Y$$, and a zero-inflated count mediator $$M$$. We fit a linear regression outcome model and a MZIP mediator model as follows: $E(Y|x,m,c)=\beta_0+\beta_1x+\beta_2m+\mathbf{\beta_4'c}$ $logit(\psi_i(M)|x,c)=\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}$ $log(\nu_i(M)|x,c)=\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}$ Because the outcome is on a difference scale, so will the NDE and NIE estimates which can be derived by $\beta_1(x-x^*)$ $\beta_2[e^{\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}}-e^{\alpha_0+\alpha_1x^*+\mathbf{\alpha_4'c}}]$ For this mediation analysis the lmoutzimed() function could be used. For example with the mzipmed_data we assume the same set of predictors as in the section with ZI outcomes, but now let mediator=mzipmed_data$ziM and outcome=mzipmed_data=lmY. Input required is the same as in the section for ZI outcomes with the C argument shown in the ZI outcomes with binary mediators example and this function uses the lm() function for the outcome variable and the mzip() function for the mediator model with all effects being on a difference scale.Covariates and confounders can be specified in the confounder argument.Standard errors can be computed using delta method, error="Delta" or via bootstrap error="Boot". For delta method robust standard errors can be used with the robust=TRUE argument. For bootstrap the number of iterations can be specified with the n argument with 1000 being the default. An offset for the zero-inflated count mediator model can be specified with the zioff argument. For example, zimed=lmoutzimed(outcome=mzipmed_data$lmY,mediator=mzipmed_data$ziM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2),error="Delta", robust=FALSE,X=1,Xstar=0,C=NULL, zioff=NULL)
#>                         Estimate     SE Lower CI Upper CI
#> Natural Direct Effect      1.388  0.366    0.670    2.106
#> Natural Indirect Effect    0.072  0.049   -0.025    0.169
#> Total Effect               1.460  0.367    0.741    2.179
#> Proportion Mediated        0.049

After running the function we can see in the R console that the NDE is statistically significant as the mean difference does not include 0 in the confidence interval, but the NIE is not significant because it includes 0. Additional information on the individual outcome and mediator models can be found in the zimed data now in the global environment.

Note: when using this functions extension with an interaction an additional argument is required if an offset is specified. This argument called OFF requires a fixed value of the offset variable to derive effects. If none are specified then the mean of the offset is used. If no offset is specified then this argument is no needed.

## Binary outcomes with ZI Mediators

For binary outcomes we do not use a logistic regression outcome model because of the non-collapsability of logit-links makes deriving closed form expression of NDE and NIE impossible so in this package we use a Poisson outcome model with a log-link with robust covariance structure using the vcovHC() function in the sandwich package. Assuming the same input as in the section for continuous outcomes, but now outcome=mzipmed_data$binYand NDE and NIE estimates will now be on a ratio scale because of the Poisson outcome model. The outcome model will be $log[P(Y_i=1|x,m,c)]=\tau_0+\tau_1x+\tau_2m+\mathbf{\tau_4'c}$ The same MZIP mediator model can be used as shown in the previous section. Risk ratios of NDE and NIE can be derived as displayed below. $RR^{NDE}=e^{\tau_1(x-x^*)}$ $RR^{NIE}=\frac{(1+e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}})(e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}}+e^{(e^{\alpha_0+\alpha_1x+\mathbf{\alpha_4'c}+log(1+e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}})})(e^{\tau_2}-1)}}{(1+e^{\gamma_0+\gamma_1x+\mathbf{\gamma_4'c}})(e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}}+e^{(e^{\alpha_0+\alpha_1x^*+\mathbf{\alpha_4'c}+log(1+e^{\gamma_0+\gamma_1x^*+\mathbf{\gamma_4'c}})})(e^{\tau_2}-1)}}$ For this scenario we will use the binoutzimed() function. For example, zimedb=binoutzimed(outcome=mzipmed_data$binY,mediator=mzipmed_data$ziM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data$C2),error="Delta", robust=FALSE,X=1,Xstar=0,C=NULL, zioff=NULL, OFF=NULL, rare=FALSE)
#>                              Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (RR)      1.325  0.104    1.081    1.624
#> Natural Indirect Effect (RR)    1.011  0.010    0.991    1.032
#> Total Effect (RR)               1.340  0.103    1.094    1.641
#> Proportion Mediated             0.044

If the outcome is a rare binary variable then odds ratios will approximate risk ratios which may provide better fit for the data. To accommodate this the functions for binary outcomes also include an additional argument, rare. By default rare=FALSE the robust Poisson model is used, but if rare=TRUE a logistic regression is fit for the outcome model. In addition, if an offset is specified then a fixed value of the offset variable is required in deriving mediation effects. This can be done using the OFF argument and should be a numeric value. If this argument is left NULL then the mean of the offset variable is used. If no offset is used then this argument is not needed. The rest of the input and output are all the same as in the section with continuous outcomes, except effects being on a ratio scale. For the zimedb example we find a significant NDE as 1 is not included in the risk ratio confidence interval, but fail to observe a significant NIE.

# Extensions of functions to cases with exposure-mediator interactions

The mzipmed package extends all the previous functions to cases where you are considering if there is an exposure-mediator interaction. In addition to the NDE and NIE additional quantities can be computed including the controlled direct effect (CDE), pure indirect effect (PIE), interactive reference effect (INTref), and interaction mediation effect (INTmed) which are the exposure-outcome relationship due to neither mediation nor interaction, only mediation, only interaction, and the tandem effect of mediation and interaction respectively. The interaction effects are for additive interactions, multiplicative interactions can be assessed directly in the outcome model. Also computed is the proportion eliminated (PE) which answers the question ‘if we were to fix the mediator to some value how much of the exposure differences in the outcome could be eliminated?’.

The functions to do this are as follows: for a continuous outcome and zero-inflated mediator use lmoutzimedint, for a binary outcome and ZI mediator use binoutzimedint, for a ZI outcome and continuous mediator use zioutlmmedint, and for a ZI outcome with a binary mediator use zioutbinmedint. We will skip model expression and formulas for effects, but they are available upon request. Instead we will show an example using zioutlmmedint. These functions have the same input as their non-interaction counterparts, but also require an additional argument in M which is a fixed value of the mediator used to compute CDE, INTref, INTmed, and PE. By default, M=NULL the mean value of the mediator is used, but can be altered e.g. M=0. Running the function will print the NDE, NIE, and TE in R console like before as well as the other 4 effects, the overall additive interaction effect, and PM, PE, and the proportion attributable to interaction.

zimout=zioutlmmedint(outcome=mzipmed_data$ziY1,mediator=mzipmed_data$lmM, exposure=mzipmed_data$X, confounder=cbind(mzipmed_data$C1,mzipmed_data\$C2),error="Delta", robust=FALSE,X=1,Xstar=0,M=NULL,C=NULL)
#>                                Estimate log SE Lower CI Upper CI
#> Natural Direct Effect (IRR)       1.105  0.241    0.689    1.772
#> Natural Indirect Effect (IRR)     1.186  0.079    1.017    1.384
#> Total Effect (IRR)                1.311  0.266    0.778    2.208
#>
#> 4-way Decomposition
#> Controlled Direct Effect (IRR)    0.990  0.148    0.741    1.323
#> Pure Indirect Effect (IRR)        1.145  0.063    1.011    1.296
#> Interactive Reference Effect      0.132  0.053    0.028    0.235
#> Interactive Mediation Effect      0.071  0.073   -0.072    0.213
#> Total Additive Interaction        0.202  0.107   -0.007    0.411
#>
#> Proportions
#> Proportion Mediated               0.662
#> Proportion Eliminated             1.032
#> Proportion from Interaction       0.562

From this example we fail to see a significant NDE, but do have a significant mediation effect based on incidence rate ratio interpretations. The interaction effects are on a difference scale regardless of the scale of the outcome variable so because 0 is included in the total additive interaction confidence interval we can conclude that we do not have evidence of a significant additive interaction effect between the exposure and mediator on the outcome. Investigating further into the zimout data in the global environment proportions attributable to each of the decomposed effects are provided. Because the interaction was not significant, it would be recommended to use the function that does not include the interaction term. It is always useful to check for an interaction effect initially.

# Conclusion

This package provides functions for conducting mediation analysis for zero-inflated count variables using MZIP. This document serves as a brief overview of the package. For more information feel free to contact the package maintainer directly.