Weighted quantile sum (WQS) regression is a statistical technique to evaluate the effect of complex exposure mixtures on an outcome (Carrico 2015). It is a single-index method which estimates a combined mixture sum effect as well as weights determining each individual mixture component’s contributions to the sum effect. However, the model features a statistical power and Type I error (i.e., false positive) rate tradeoff, as there is a machine learning step to determine the weights that optimize the linear model fit. If the full data is used to estimate both the mixture component weights and the regression coefficients, there is high power but also a high false positive rate since coefficient p-values are calculated for a weighted mixture independent variable with weights that have already been optimized to find a large effect.

We recently proposed alternative methods based on a permutation test that should reliably allow for both high power and low false positive rate when utilizing WQS regression. The permutation test is a method of obtaining a p-value by simulating the null distribution through permutations of the data. The permutation test algorithm is described more in detail and validated in Day et al. 2022. The version of this permutation test used for a continuous outcome variable has been applied in Loftus et al. 2021, Day et al. 2021, Wallace et al. 2022, Barrett et al. 2022, and Freije et al. 2022. Another version of the permutation test adapted for WQS logistic regression with a binary outcome variable is applied in Loftus et al. 2022.

The goal of WQS regression is to determine whether an exposure mixture is associated with an outcome in a prespecified direction. It fits the following model:

\(Y = \beta_0 + \beta_1(\sum_{i=1}^{m} w_i {X_q}_i) + Z'\gamma\)

Where \(Y\) is the outcome variable, \(\beta_0\) is the intercept, \(\beta_1\) is the coefficient for the weighted quantile sum, \(\sum_{i=1}^{c} w_i {X_q}_i\) is the weighted index for the set of quantiled mixture exposures, \(Z\) is the set of covariates, and \(\gamma\) is the regression coefficients for the covariates.

A full description of the WQS methodology is described in Carrico 2015.

The WQS regression comprises two steps, for which we typically split the data into a training and validation set. Doing this reduces statistical power since we are training our model on only part of the data. On the other hand, if we skip this training/test split, we can get a skewed representation of uncertainty for the WQS coefficient. A permutation test method gives us a p-value for the uncertainty while also allowing us to use the full dataset for training and validation. This p-value is based on comparing a test value (e.g., coefficient or naive p-values) to iterated values, and so the minimum non-zero p-value that can be detected by the permutation test would be 1 divided by the number of permutation test iterations. For example, if we run 200 iterations, we’d be able to define a p-value of as low as 1/200 = 0.005, and any lower p-value would appear as zero and be interpreted as <0.005.

Run WQS regression without splitting the data, obtaining a WQS coefficient estimate.

Regress the outcome on all covariates but not the WQS variable. Then obtain the predicted outcome values and their residuals from this regression.

Randomly permute the residual values and add them to the predicted outcome values to get a new outcome variable \(y*\).

Run a WQS regression without splitting the data in which \(y*\) replaces the vector of observed outcome variables, obtaining an estimate for the WQS coefficient \(\beta_1^*\).

Repeat steps 3 and 4.

Calculate the p-value by taking the proportion of \(\beta_1^*\) values greater than the WQS coefficient estimate obtained in Step 1.

Regress each of the \(m\) mixture components on all covariates \(Z\) and obtain a \(n\) observations x \(m\) matrix with columns being the residuals from each of the \(m\) models (\(R_{m|Z}\)).

Obtain the initial fit (\(fit1\)) by running a “non-split” WQS logistic regression (or other WQS GLM) in which the binary (or count) outcome variable \(Y\) is regressed on the WQS vector and the covariates, and the mixture matrix used to calculate the WQS vector is the matrix of residuals from Step 1, \(R_{m|Z}\).

Obtain the reduced fit (\(fit2\)) by running a logistic regression (or other GLM) regressing \(Y\) on \(Z\).

Calculate the test p-value (\(p_t\)) as \(1-pchisq(d(fit1)-d(fit2),1)\) where d is the deviance for a given model and \(pchisq(x,1)\) is the probability density function of the chi-square distribution in which the input \(x\) is the difference between the deviances of \(fit1\) and \(fit2\) and there is 1 degree of freedom.

Permute the rows of the \(R_{m|Z}\) residual matrix from Step 1 and repeat Step 2 to get a series of null fit1 models (\(fit1^*\)) for K iterations. Obtain a distribution of permuted p-values (\(p^*\)) using the following formula: \(p^*=1-pchisq(fit1^*)-d(fit2),1\)).

Obtain the number of permuted \(p^*\) less than or equal to the test \(p_t\) from Step 4 and divide that by the number of iterations K to calculate the permutation test p-value.

Note that the above algorithm has been validated in WQS logistic regressions but not yet for other forms of WQS GLMs (e.g., WQS Poisson regression). However, since deviances can also be derived from those models, the algorithm should work for those other WQS GLMs as well.

`wqspt`

packageThe `wqspt`

package builds from the `gWQS`

package.

The two main functions of the `wqspt`

package are
`wqs_pt`

and `wqs_full_perm`

.

`wqs_pt`

`wqs_pt`

uses a `gwqs`

object (from the
`gWQS`

package) as an input.
To use `wqs_pt`

, we first need to run an initial
*permutation test reference WQS regression* run while setting
`validation = 0`

. Note that permutation test can currently
take in `gwqs`

inputs with the following families:
`family = gaussian(link = "identity")`

,
`family = binomial()`

with any accepted link function (e.g.,
“logit” or “probit”), `family = poisson(link = "log")`

,
`family = quasipoisson(link = "log")`

, and
`family = "negbin"`

for negative binomial. It is not
currently able to accommodate multinomial WQS regression, stratified
weights, or WQS interaction terms.

We will use this `gwqs`

object as the `model`

argument for the `wqs_pt`

function and set the following
additional parameters:

`boots`

: Number of bootstraps for the WQS regression run in each permutation test iteration. Note that we may elect a bootstrap count`boots`

lower than that specified in the`model`

object for the sake of efficiency. If we do,`wqs_pt`

will run the iterated WQS regressions for the permutation test with the number of bootstraps defined in`boots`

. If`boots`

is not specified, then the function will use the same bootstrap count in the permutation test iterated WQS regressions as that specified in the main WQS regression.`niter`

: Number of permutation test iterations.`b1_pos`

: A logical value that indicates whether beta values should be positive or negative.`rs`

: A logical value indicating whether the random subset implementation for WQS should be performed (Curtin 2019)`plan_strategy`

: Evaluation strategy for the plan function (“sequential”, “transparent”, “multisession”, “multicore”, “multiprocess”, “cluster”, or “remote”). See the documentation for the future::plan function for full details.

`seed`

: Random seed for the permutation test WQS reference run

The arguments `b1_pos`

and `rs`

should be
consistent with the inputs chosen in the `model`

object. The
`seed`

should ideally be consistent with the seed set in the
`model`

object, though this is not required.

The permutation test returns an object of class `wqs_pt`

,
which contains three sublists:

**perm_test****pval**: permutation test p-value*Linear WQS regression only***testbeta**: reference WQS coefficient \(\beta_1\) value**betas**: a vector of \(\beta_1\) values from each iteration of the permutation test*WQS GLM only***testpval**: test reference p-value**permpvals**: p-values from each iteration of the permutation test

**gwqs_main**: main gWQS object (same as`model`

input)**gwqs_perm**: permutation test reference gWQS object (NULL if model`family != "gaussian"`

or if same number of bootstraps are used in permutation test WQS regression runs as in the main run.)

The `wqs_pt`

class has a `wqspt_plot`

method to
help visualize and summarize WQS permutation test results. Plots include
(1) a forest plot of the beta WQS coefficient with the naive confidence
intervals as well as the permutation test p-value and (2) a heatmap of
the WQS weights for each mixture component.

`wqs_full_perm`

The second function `wqs_full_perm`

is a full wrapper
which implements the initial WQS regression run using gWQS::gwqs and the
permutation test in one function call.

To use `wqs_full_perm`

, you must specify the same required
arguments as needed in the `gwqs`

call. This function can run
WQS regressions and the permutation test for the following families:
`family = gaussian(link = "identity")`

,
`family = binomial()`

with any accepted link function (e.g.,
“logit” or “probit”), `family = poisson(link = "log")`

,
`family = quasipoisson(link = "log")`

, and
`family = "negbin"`

for negative binomial.
`wqs_full_perm`

is not currently able to accommodate
multinomial WQS regression, stratified weights, or WQS interaction
terms.

For the bootstrap count `b`

argument, you must specify
`b_main`

,the number of bootstraps for the *main WQS
regression* run, as well as `b_perm`

, the number of
bootstraps for the *permutation test reference WQS regression*
run (linear WQS regression only) and each WQS regression iteration of
the permutation test. As with before, you can choose to set
`b_main`

\(>\)
`b_perm`

for the sake of efficiency. Finally, you should
indicate the number of desired permutation test runs
`niter`

.

Since the WQS permutation test can be computationally intensive, you
can specify `stop_if_nonsig = TRUE`

if you do not wish for
the permutation test to proceed if the naive main WQS regression run
produces an nonsignificant result (if the p-value is below the
`stop_thresh`

argument, for which the default is 0.05). See
*Recommendations for Use* section below.

The `wqs_full_perm`

returns an object of class
`wqs_pt`

, with outputs described above.

Larger bootstrap counts and numbers of iterations lead to better estimates, though it is unclear how many iterations or bootstraps are needed for a stable estimate. We generally recommend using 1000 bootstraps on the main WQS regression and then performing 200 iterations of 200-boostrap WQS regressions for the permutation test. However, this takes a substantial amount of computational time, and one could also get relatively stable p-values for instance for 100 iterations of 100-boostrap WQS regressions for the permutation test.

We recommend that users only apply the permutation test in cases where the naive WQS test approaches significance or near-significance. If the naive test produces a non-significant result, then there likely is no reason to run the permutation test, as it will produce a result which is more conservative than the naive method (i.e., it will have a larger p-value). This is the strategy that we have applied in our published papers (Loftus et al. 2021 and Day et al. 2021).

`wqs_pt`

)This is an example where the WQS permutation test confirms a significant naive result.

We first load both the wqspt and gWQS packages.

```
library(gWQS)
library(wqspt)
```

Then we produce a simulated dataset with the following parameters:

- WQS coefficient \(\beta_1\): 0.2
- Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components

```
# simulated dataset
<- wqs_sim(nmix = 10,
sim_res1 ncovrt = 10,
nobs = 1000,
ntruewts = 10,
ntruecovrt = 5,
truewqsbeta = 0.2,
truebeta0 = 2,
truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
0.05, 0.05, 0.05, 0.05, 0.05),
q = 10,
seed = 16)
<- sim_res1$Data
sim_data1
<- formula(paste0("y ~ wqs + ", paste(paste0("C",1:10), collapse="+"))) wqs_form
```

Now we run WQS regression on the simulated data.

```
# mixture names
<- colnames(sim_data1)[2:11]
mix_names1
# create reference wqs object
<- gwqs(wqs_form, mix_name = mix_names1, data = sim_data1, q = 10, validation = 0,
wqs_main1 b = 20, b1_pos = T, plan_strategy = "multicore", family = "gaussian",
seed = 16)
```

Finally, we can perform a permutation test on the WQS object.

```
# run permutation test
<- wqs_pt(wqs_main1, niter = 50, boots = 5, b1_pos = T, seed = 16) perm_test_res1
```

Note that the naive WQS regression produces a significant result for the WQS coefficient (p-value < 0.001).

`<- summary(perm_test_res1$gwqs_main) main_sum1 `

```
$coefficients
main_sum1#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.716588215 0.13195745 13.0086490 7.960621e-36
#> wqs 0.271247707 0.02846484 9.5292204 1.183377e-20
#> C1 0.915972918 0.03235401 28.3109569 1.372658e-129
#> C2 1.837398541 0.03166084 58.0337820 1.460458e-320
#> C3 -1.567906582 0.03096844 -50.6291836 9.835375e-277
#> C4 -0.261844308 0.03025602 -8.6542893 1.987022e-17
#> C5 -0.350600283 0.03111404 -11.2682323 8.594271e-28
#> C6 0.017181769 0.03214707 0.5344739 5.931340e-01
#> C7 0.028020482 0.03007333 0.9317386 3.516993e-01
#> C8 0.006594393 0.03040937 0.2168540 8.283669e-01
#> C9 -0.075174923 0.03029635 -2.4813194 1.325512e-02
#> C10 -0.003960226 0.03079737 -0.1285898 8.977084e-01
```

The permutation test confirms the significance of this result.

```
$perm_test$pval
perm_test_res1#> [1] 0
```

Here are the summary plots:

`wqspt_plot(perm_test_res1)$FullPlot`

`wqs_pt`

)This is an example where the WQS permutation test goes against a (false positive) significant naive result.

We produce a simulated dataset with the following parameters:

- WQS coefficient \(\beta_1\): 0
- Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components

```
<- wqs_sim(nmix = 10,
sim_res2 ncovrt = 10,
nobs = 1000,
ntruewts = 10,
ntruecovrt = 5,
truewqsbeta = 0,
truebeta0 = 0.1,
truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
0.05, 0.05, 0.05, 0.05, 0.05),
q = 10,
seed = 16)
<- sim_res2$Data sim_data2
```

Now we run WQS regression as well as the permutation test on the simulated data.

```
# mixture names
<- colnames(sim_data2)[2:11]
mix_names2
# create reference wqs object
<- gwqs(wqs_form, mix_name = mix_names2, data = sim_data2, q = 10, validation = 0,
wqs_main2 b = 20, b1_pos = T, plan_strategy = "multicore", family = "gaussian",
seed = 16)
# run permutation test
<- wqs_pt(wqs_main2, niter = 50, boots = 5, b1_pos = T, seed = 16) perm_test_res2
```

Note that the naive WQS regression produces a significant result for the WQS coefficient (p-value = 0.002).

`<- summary(perm_test_res2$gwqs_main) main_sum2 `

```
$coefficients
main_sum2#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.242259146 0.12595439 -1.9233879 5.471846e-02
#> wqs 0.084337304 0.02709982 3.1120982 1.910997e-03
#> C1 0.916079837 0.03236572 28.3040164 1.530151e-129
#> C2 1.836539134 0.03166066 58.0069707 2.079028e-320
#> C3 -1.568725043 0.03099602 -50.6105377 1.279103e-276
#> C4 -0.260464045 0.03027002 -8.6046855 2.973949e-17
#> C5 -0.350431787 0.03114219 -11.2526386 1.005351e-27
#> C6 0.016352846 0.03216484 0.5084075 6.112811e-01
#> C7 0.027852144 0.03008416 0.9258077 3.547720e-01
#> C8 0.006968031 0.03042095 0.2290537 8.188746e-01
#> C9 -0.074755354 0.03031780 -2.4657251 1.384261e-02
#> C10 -0.003661685 0.03081250 -0.1188377 9.054281e-01
```

The permutation test, however, repudiates the significance of these plots (p = 0.12).

```
$perm_test$pval
perm_test_res2#> [1] 0.12
```

Here are the summary plots:

`wqspt_plot(perm_test_res2)$FullPlot`

`wqs_full_perm`

)Using the same data as in Example 1, we run the WQS regression with
permutation test using the full wrapper `wqs_full_perm`

call.

```
<- wqs_full_perm(wqs_form,
perm_test_res3 data = sim_data1,
mix_name = mix_names1,
q = 10,
b_main = 20,
b_perm = 5,
b1_pos = T,
niter = 50,
seed = 16,
plan_strategy = "multicore")
```

`wqspt_plot(perm_test_res3)$FullPlot`

`wqs_full_perm`

on binary outcome
example)This is an example in which we apply the logistic regression version of the WQS permutation test.

We produce a simulated dataset with the following parameters:

- WQS coefficient \(\beta_1\): 0.4
- Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components

```
<- wqs_sim(nmix = 10,
sim_res3 ncovrt = 10,
nobs = 1000,
ntruewts = 10,
ntruecovrt = 5,
truewqsbeta = 0.4,
truebeta0 = -2.5,
truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
0.05, 0.05, 0.05, 0.05, 0.05),
q = 10,
family = "binomial",
seed = 16)
<- sim_res3$Data
sim_data3
<- wqs_full_perm(wqs_form,
perm_test_res4 data = sim_data3,
mix_name = mix_names1,
q = 10,
b_main = 20,
b_perm = 5,
b1_pos = T,
niter = 50,
seed = 16,
plan_strategy = "multicore",
family = "binomial")
```

`wqspt_plot(perm_test_res4)$FullPlot`

Barrett, E. S., Corsetti, M., Day, D., Thurston, S. W., Loftus, C. T., Karr, C. J., … & Sathyanarayana, S. (2022). Prenatal phthalate exposure in relation to placental corticotropin releasing hormone (pCRH) in the CANDLE cohort. Environment International, 160, 107078.

Carrico, C., Gennings, C., Wheeler, D. C., & Factor-Litvak, P. (2015). Characterization of weighted quantile sum regression for highly correlated data in a risk analysis setting. Journal of Agricultural, Biological, and Environmental Statistics, 20(1), 100-120.

Curtin, P., Kellogg, J., Cech, N., & Gennings, C. (2019). A random subset implementation of weighted quantile sum (WQSRS) regression for analysis of high-dimensional mixtures. Communications in Statistics-Simulation and Computation, 50(4), 1119-1134.

Day, D. B., Collett, B. R., Barrett, E. S., Bush, N. R., Swan, S. H., Nguyen, R. H., … & Sathyanarayana, S. (2021). Phthalate mixtures in pregnancy, autistic traits, and adverse childhood behavioral outcomes. Environment International, 147, 106330.

Day, D. B., Sathyanarayana, S., LeWinn, K. Z., Karr, C. J., Mason, W. A., & Szpiro, A. A. (2022). A permutation test-based approach to strengthening inference on the effects of environmental mixtures: comparison between single index analytic methods. Environmental Health Perspectives, 130(8).

Freije, S. L., Enquobahrie, D. A., Day, D. B., Loftus, C., Szpiro, A. A., Karr, C. J., … & Sathyanarayana, S. (2022). Prenatal exposure to polycyclic aromatic hydrocarbons and gestational age at birth. Environment International, 164, 107246.

Loftus, C. T., Bush, N. R., Day, D. B., Ni, Y., Tylavsky, F. A., Karr, C. J., … & LeWinn, K. Z. (2021). Exposure to prenatal phthalate mixtures and neurodevelopment in the Conditions Affecting Neurocognitive Development and Learning in Early childhood (CANDLE) study. Environment International, 150, 106409.

Loftus, C., Szpiro, A. A., Workman, T., Wallace, E. R., Hazlehurst, M. F., Day, D. B., … & Karr, C. J. (2022). Maternal exposure to urinary polycyclic aromatic hydrocarbons (PAH) in pregnancy and childhood asthma in a pooled multi-cohort study. Environment International, 170, p.107494.

Renzetti, S., Curtin, P., Just, A. C., Bello, G., & Gennings, C. (2021). ‘gWQS: Generalized Weighted Quantile Sum Regression’. https://CRAN.R-project.org/package=gWQS.

Wallace, E. R., Ni, Y., Loftus, C. T., Sullivan, A., Masterson, E., Szpiro, A., … & Sathyanarayana, S. (2022). Prenatal urinary metabolites of polycyclic aromatic hydrocarbons and toddler cognition, language, and behavior. Environment International, 159, 107039.