Note that you can cite this work as:
Jolicoeur-Martineau, A., Wazana, A., Szekely, E., Steiner, M., Fleming, A. S., Kennedy, J. L., Meaney M. J. & Greenwood, C.M. (2017). Alternating optimization for GxE modelling with weighted genetic and environmental scores: examples from the MAVAN study. arXiv preprint arXiv:1703.08111.
The Latent Environmental & Genetic InTeraction (LEGIT) model is an interaction model with two latent variables: \(\mathbf{g}\) a weighted sum of genetic variants (genetic score) and \(\mathbf{e}\) a weighted sum of environmental variables (environmental score).
Assuming \(\mathbf{g}_1\),…,\(\mathbf{g}_k\) are \(k\) genetic variants of interest, generally represented as the number of dominant alleles (0, 1, 2) or the absence or presence of a dominant allele (0, 1) and \(\mathbf{e}_1\),…,\(\mathbf{e}_s\) are \(s\) environments of interest, generally represented as ordinal scores (0, 1, 2 …) or continuous variables. We define the genetic score \(\mathbf{g}\) and environmental score \(\mathbf{e}\) as:
\[\mathbf{g}=\sum_{i=1}^{k} p_i \mathbf{g}_j \] \[\mathbf{e}=\sum_{l=1}^{s} q_l \mathbf{e}_l \] where \(\|p\|_1=\sum_{i=1}^{k}p_j=1\) and \(\|q\|_1= \sum_{l=1}^{s}q_l=1\). The weights can be interpreted as the relative contributions of each genetic variant or environment.
The weights of these scores are estimated within a generalized linear model of the form :
\[\mathbf{E}[\mathbf{y}] = g^{-1}(f(\mathbf{g},\mathbf{e},X|\beta)+X_{covs} \beta_{covs})\] where \(g\) is the link function, \(f\) is a linear function between \(\mathbf{g}\), \(\mathbf{e}\) and other variables from the matrix \(X\) and \(X_{covs}\) is a matrix of additionnal covariates.
Although this approach was originally made for GxE modelling, it is flexible and does not require the use of genetic and environmental variables. It can also handle more than 2 latent variables (rather than just G and E) and 3-way interactions or more.
For a two-way \(G\times E\) model, we would define \(f\) as:
\[f(\mathbf{g},\mathbf{e},X|\beta) = \beta_0+\beta_e \mathbf{e}+\beta_g \mathbf{g}+\beta_{eg} \mathbf{e}\mathbf{g}\]
For a three-way \(G\times E \times Z\) model, we would define \(f\) as:
\[f(\mathbf{g},\mathbf{e},X|\beta) = \beta_0+\beta_e \mathbf{e}+\beta_g \mathbf{g}+\beta_z \mathbf{z}+\beta_{eg} \mathbf{eg}+\beta_{ez} \mathbf{ez}+\beta_{zg} \mathbf{zg}+\beta_{egz} \mathbf{egz}\]
Here is an example with 2 latent variables and a 2-way interaction (see arxiv) :
Here is an example with 3 latent variables and a 3-way interaction (see arxiv) :
Full details of the algorithm are available on arXiv.
In the LEGIT package, we include the following functions:
To fit a LEGIT model with 2 latent variables (G and E)
To fit a IMLEGIT model (LEGIT model with more than 2 latent variables)
To simulate examples of GxE models
Others
Let’s look at a three-way interaction model with continuous outcome:
\[\mathbf{g}_j \sim Binomial(n=1,p=.30) \\ j = 1, 2, 3, 4\] \[\mathbf{e}_l \sim Normal(\mu=0,\sigma=1.5) \\ l = 1, 2, 3\] \[\mathbf{z} \sim Normal(\mu=3,\sigma=1)\] \[\mathbf{g} = .2\mathbf{g}_1 + .15\mathbf{g}_2 - .3\mathbf{g}_3 + .1\mathbf{g}_4 + .05\mathbf{g}_1\mathbf{g}_3 + .2\mathbf{g}_2\mathbf{g}_3 \] \[ \mathbf{e} = -.45\mathbf{e}_1 + .35\mathbf{e}_2 + .2\mathbf{e}_3\] \[\mathbf{\mu} = -2 + 2\mathbf{g} + 3\mathbf{e} + \mathbf{z} + 5\mathbf{ge} + 2\mathbf{gz} - 1.5\mathbf{ez} + 2\mathbf{gez} \] \[ y \sim Normal(\mu,\sigma=1)\]
Let’s load the package and look at the dataset.
## $data
## y y_true z
## 1 3.80723699 4.67808801 3.184193
## 2 7.96819988 7.24948932 3.752280
## 3 4.49051121 4.37985833 3.591745
## 4 -0.01796159 0.06050518 2.016947
## 5 0.71520307 1.13569353 2.723936
##
## $G
## g1 g2 g3 g4 g1_g3 g2_g3
## 1 1 1 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 1 0 0 1
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
##
## $E
## e1 e2 e3
## 1 0.5354793 0.701520767 1.259626
## 2 4.0751277 -1.340701085 1.058013
## 3 3.4221779 -0.460992449 1.958947
## 4 0.4860308 -0.007233633 -2.081994
## 5 2.8441006 1.482246224 1.909375
##
## $coef_G
## [1] 0.20 0.15 -0.30 0.10 0.05 0.20
##
## $coef_E
## [1] -0.45 0.35 0.20
##
## $coef_main
## [1] -2.0 2.0 3.0 1.0 5.0 -1.5 2.0 2.0
Currently “data” contains the outcome, the true outcome without the noise and the covariate \(z\). This dataset should always contain the outcome and all covariates (except for the genes and environments from \(\mathbf{g}\) and \(\mathbf{e}\)).”G” contains the genetic variants and “E” contains the environments. This is all you need to fit a LEGIT model.
We generate N=250 training observations and 100 test observations.
train = example_3way(N=250, sigma=1, logit=FALSE, seed=7)
test = example_3way(N=100, sigma=1, logit=FALSE, seed=6)
We fit the model with the default starting point.
## Converged in 11 iterations
## $fit_main
##
## Call:
## stats::glm(formula = formula, family = family, data = data, model = FALSE,
## y = FALSE)
##
## Coefficients: (-7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.77075 0.22342 -7.926 9.09e-14 ***
## G 0.15934 1.12858 0.141 0.888
## E -2.86208 0.27217 -10.516 < 2e-16 ***
## z 0.94193 0.06704 14.050 < 2e-16 ***
## G:E -5.69624 1.38253 -4.120 5.25e-05 ***
## G:z 2.61973 0.34190 7.662 4.77e-13 ***
## E:z 1.43411 0.07991 17.947 < 2e-16 ***
## G:E:z -1.82548 0.41205 -4.430 1.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.065046)
##
## Null deviance: 2414.03 on 249 degrees of freedom
## Residual deviance: 250.29 on 235 degrees of freedom
## AIC: 741.75
##
## Number of Fisher Scoring iterations: 2
##
##
## $fit_genes
##
## Call:
## stats::glm(formula = formula_b, family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-9 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## g1 0.18333 0.01051 17.447 < 2e-16 ***
## g2 0.15867 0.01422 11.156 < 2e-16 ***
## g3 -0.29336 0.01269 -23.115 < 2e-16 ***
## g4 0.09770 0.01090 8.966 < 2e-16 ***
## g1_g3 0.06273 0.02247 2.791 0.00568 **
## g2_g3 0.20421 0.02329 8.770 3.68e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.06503)
##
## Null deviance: 2414.03 on 250 degrees of freedom
## Residual deviance: 250.28 on 235 degrees of freedom
## AIC: 741.75
##
## Number of Fisher Scoring iterations: 2
##
##
## $fit_env
##
## Call:
## stats::glm(formula = formula_c, family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-12 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## e1 0.43440 0.01486 29.24 <2e-16 ***
## e2 -0.35307 0.01756 -20.11 <2e-16 ***
## e3 -0.21253 0.01614 -13.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.065045)
##
## Null deviance: 2414.03 on 250 degrees of freedom
## Residual deviance: 250.29 on 235 degrees of freedom
## AIC: 741.75
##
## Number of Fisher Scoring iterations: 2
This is very close to the original model. We can now use the test dataset to find the validation \(R^2\).
ssres = sum((test$data$y - predict(fit_default, test$data, test$G, test$E))^2)
sstotal = sum((test$data$y - mean(test$data$y))^2)
R2 = 1 - ssres/sstotal
R2
## [1] 0.8809351
We can also plot the model at specific values of z.
cov_values = c(3)
names(cov_values) = "z"
plot(fit_default, cov_values = cov_values,cex.leg=1.4, cex.axis=1.5, cex.lab=1.5)
Now, let’s see what happens if we introduce variables that are not in the model. Let’s add these irrelevant genetic variants:
\[\mathbf{g'}_b \sim Binomial(n=1,p=.30) \\ b = 1, 2, 3, 4, 5\]
g1_bad = rbinom(250,1,.30)
g2_bad = rbinom(250,1,.30)
g3_bad = rbinom(250,1,.30)
g4_bad = rbinom(250,1,.30)
g5_bad = rbinom(250,1,.30)
train$G = cbind(train$G, g1_bad, g2_bad, g3_bad, g4_bad, g5_bad)
Let’s do a forward search of the genetic variants using the BIC and see if we can recover the right subset of variables.
forward_genes_BIC = stepwise_search(train$data, genes_extra=train$G, env_original=train$E, formula=y ~ E*G*z, search_type="forward", search="genes", search_criterion="BIC", interactive_mode=FALSE)
## Keeping only variables with p-values smaller than 0.2 and which inclusion decrease the AIC
## Forward search of the genes to find the model with the lowest BIC
##
## [Iteration: 1]
## Adding gene: g3 (Criterion before = Inf; after = 1125.42059)
## [Iteration: 2]
## Adding gene: g2 (Criterion before = 1125.42059; after = 1042.75045)
## [Iteration: 3]
## Adding gene: g1 (Criterion before = 1042.75045; after = 888.44286)
## [Iteration: 4]
## Adding gene: g4 (Criterion before = 888.44286; after = 846.44854)
## [Iteration: 5]
## Adding gene: g2_g3 (Criterion before = 846.44854; after = 799.55835)
## [Iteration: 6]
## Adding gene: g1_g3 (Criterion before = 799.55835; after = 798.62631)
## [Iteration: 7]
## No gene added
We recovered the right subset! Now what if we did a backward search using the AIC?
backward_genes_AIC = stepwise_search(train$data, genes_original=train$G, env_original=train$E, formula=y ~ E*G*z, search_type="backward", search="genes", search_criterion="AIC", interactive_mode=FALSE)
## Dropping only variables with p-values bigger than 0.01 and which removal decrease the AIC
## Backward search of the genes to find the model with the lowest AIC
##
## [Iteration: 1]
## Removing gene: g3_bad (Criterion before = 749.77592; after = 747.50098)
## [Iteration: 2]
## Removing gene: g5_bad (Criterion before = 747.50098; after = 745.40215)
## [Iteration: 3]
## Removing gene: g4_bad (Criterion before = 745.40215; after = 743.57487)
## [Iteration: 4]
## Removing gene: g1_bad (Criterion before = 743.57487; after = 741.8472)
## [Iteration: 5]
## Removing gene: g2_bad (Criterion before = 741.8472; after = 741.75015)
## [Iteration: 6]
## No gene removed
We deleted the irrevelant genes and obtained the right subset of variables! The stepwise_search function also has an interactive mode where the user decides which variable should be added/dropped at every step. We can only show the first iteration because the algorithm does’nt receive an input from the user in the vignette but normally you can control the variables added or removed from the stepwise search. This is what the interactive mode looks like:
forward_genes_BIC = stepwise_search(train$data, genes_extra=train$G, env_original=train$E, formula=y ~ E*G*z, search_type="bidirectional-forward", search="genes", search_criterion="BIC", interactive_mode=TRUE)
## <<~ Interative mode enabled ~>>
## Keeping only variables with p-values smaller than 0.2 and which inclusion decrease the AIC
## Dropping only variables with p-values bigger than 0.01 and which removal decrease the AIC
## Bidirectional search of the genes to find the model with the lowest BIC
##
## [Iteration: 1]
## variable N_old N_new p_value AIC_old AIC_new AICc_old AICc_new BIC_old
## 1 g3 NA 250 0.00000 Inf 1086.685 Inf 1087.794 Inf
## 2 g1 NA 250 0.00000 Inf 1093.552 Inf 1094.661 Inf
## 3 g2 NA 250 0.00000 Inf 1135.100 Inf 1136.209 Inf
## 4 g1_bad NA 250 0.00000 Inf 1155.719 Inf 1156.828 Inf
## 5 g4 NA 250 0.00013 Inf 1164.285 Inf 1165.394 Inf
## BIC_new
## 1 1125.421
## 2 1132.288
## 3 1173.836
## 4 1194.455
## 5 1203.021
## Enter the index of the gene to be added:
## No gene added
Manually forcing \(\mathbf{g}_3\) inclusion since the interactive mode cannot progress without a user, we get that the second iteration is:
forward_genes_BIC = stepwise_search(train$data, genes_original=train$G[,3,drop=FALSE], genes_extra=train$G[,-3], env_original=train$E, formula=y ~ E*G*z, search_type="bidirectional-forward", search="genes", search_criterion="BIC", interactive_mode=TRUE)
## <<~ Interative mode enabled ~>>
## Keeping only variables with p-values smaller than 0.2 and which inclusion decrease the AIC
## Dropping only variables with p-values bigger than 0.01 and which removal decrease the AIC
## Bidirectional search of the genes to find the model with the lowest BIC
##
## [Iteration: 1]
## variable N_old N_new p_value AIC_old AIC_new AICc_old AICc_new BIC_old
## 1 g2 250 250 0.000000 1086.685 1000.493 1087.794 1001.809 1125.421
## 2 g2_g3 250 250 0.000000 1086.685 1006.253 1087.794 1007.570 1125.421
## 3 g1 250 250 0.000000 1086.685 1013.285 1087.794 1014.601 1125.421
## 4 g1_g3 250 250 0.000002 1086.685 1068.051 1087.794 1069.367 1125.421
## 5 g4 250 250 0.012411 1086.685 1084.133 1087.794 1085.450 1125.421
## BIC_new
## 1 1042.750
## 2 1048.511
## 3 1055.542
## 4 1110.308
## 5 1126.391
## Enter the index of the gene to be added:
## No gene added
With the interactive mode, you can try alternative pathways, rather than simply adding/dropping the best/worst variable everytime.
Let’s take a quick look at a simple two-way example with binary outcome:
\[\mathbf{g}_j \sim Binomial(n=1,p=.30) \\ j = 1, 2, 3, 4\] \[\mathbf{e}_l \sim Normal(\mu=0,\sigma=1.5) \\ l = 1, 2, 3\] \[\mathbf{g} = .2\mathbf{g}_1 + .15\mathbf{g}_2 - .3\mathbf{g}_3 + .1\mathbf{g}_4 + .05\mathbf{g}_1\mathbf{g}_3 + .2\mathbf{g}_2\mathbf{g}_3 \] \[ \mathbf{e} = -.45\mathbf{e}_1 + .35\mathbf{e}_2 + .2\mathbf{e}_3\] \[\mathbf{\mu} = -1 + 2\mathbf{g} + 3\mathbf{e} + 4\mathbf{ge} \] \[ y \sim Binomial(n=1,p=logit(\mu))\]
We generate N=1000 training observations.
We fit the model with the default starting point.
## Converged in 6 iterations
## $fit_main
##
## Call:
## stats::glm(formula = formula, family = family, data = data, model = FALSE,
## y = FALSE)
##
## Coefficients: (-7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9892 0.1065 -9.293 < 2e-16 ***
## G 2.0863 0.6747 3.092 0.001989 **
## E 3.2313 0.2148 15.042 < 2e-16 ***
## G:E 4.3265 1.1960 3.617 0.000298 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1351.50 on 999 degrees of freedom
## Residual deviance: 692.69 on 989 degrees of freedom
## AIC: 714.69
##
## Number of Fisher Scoring iterations: 6
##
##
## $fit_genes
##
## Call:
## stats::glm(formula = formula_b, family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## g1 0.12059 0.07512 1.605 0.10842
## g2 0.08314 0.06927 1.200 0.23005
## g3 -0.28041 0.05142 -5.454 4.94e-08 ***
## g4 0.02208 0.05359 0.412 0.68030
## g1_g3 0.06894 0.12845 0.537 0.59146
## g2_g3 0.42484 0.15154 2.804 0.00505 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1351.50 on 1000 degrees of freedom
## Residual deviance: 692.69 on 989 degrees of freedom
## AIC: 714.69
##
## Number of Fisher Scoring iterations: 6
##
##
## $fit_env
##
## Call:
## stats::glm(formula = formula_c, family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-8 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## e1 -0.43085 0.02905 -14.832 <2e-16 ***
## e2 0.35751 0.02613 13.681 <2e-16 ***
## e3 0.21164 0.02156 9.814 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1351.50 on 1000 degrees of freedom
## Residual deviance: 692.69 on 989 degrees of freedom
## AIC: 714.69
##
## Number of Fisher Scoring iterations: 6
We are a little off, especially with regards to the weights of the genetic variants. This is because there is substantial loss of information with binary outcomes. To assess the quality of the fit, we are going to do a 5-Folds cross-validation.
cv_5folds_bin = LEGIT_cv(train$data, train$G, train$E, y ~ G*E, cv_iter=1, cv_folds=5, classification=TRUE, family=binomial, seed=777)
pROC::plot.roc(cv_5folds_bin$roc_curve[[1]])
Although the weights of the genetic variants are a bit off, the model predictive power is good.
We can also plot the model.
In example 1 we looked at a 3-way model but only two of the three variables were latent. Let’s construct the model another time but this time with three latent variables:
\[\mathbf{g}_j \sim Binomial(n=1,p=.30) \\ j = 1, 2, 3, 4\] \[\mathbf{e}_l \sim Normal(\mu=0,\sigma=1.5) \\ l = 1, 2, 3\] \[\mathbf{z}_t \sim Normal(\mu=3,\sigma=1)\] \[\mathbf{g} = .2\mathbf{g}_1 + .15\mathbf{g}_2 - .3\mathbf{g}_3 + .1\mathbf{g}_4 + .05\mathbf{g}_1\mathbf{g}_3 + .2\mathbf{g}_2\mathbf{g}_3 \] \[ \mathbf{e} = -.45\mathbf{e}_1 + .35\mathbf{e}_2 + .2\mathbf{e}_3\] \[ \mathbf{z} = .15\mathbf{z}_1 + .60\mathbf{z}_2 + .25\mathbf{z}_3\] \[\mathbf{\mu} = -2 + 2\mathbf{g} + 3\mathbf{e} + \mathbf{z} + 5\mathbf{ge} + 2\mathbf{gz} - 1.5\mathbf{ez} + 2\mathbf{gez} \] \[ y \sim Normal(\mu,\sigma=1)\]
Let’s load the package and look at the dataset.
## $data
## y y_true
## 1 4.6022635 3.383713
## 2 6.5474680 7.246785
## 3 2.5382622 2.823695
## 4 0.2960498 1.607602
## 5 0.6674754 1.058488
##
## $latent_var
## $latent_var$G
## g1 g2 g3 g4 g1_g3 g2_g3
## 1 1 1 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 1 0 0 1
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
##
## $latent_var$E
## e1 e2 e3
## 1 0.5354793 0.701520767 1.259626
## 2 4.0751277 -1.340701085 1.058013
## 3 3.4221779 -0.460992449 1.958947
## 4 0.4860308 -0.007233633 -2.081994
## 5 2.8441006 1.482246224 1.909375
##
## $latent_var$Z
## z1 z2 z3
## 1 3.184193 2.129149 2.437874
## 2 3.752280 3.718711 3.997513
## 3 3.591745 3.110653 1.894870
## 4 2.016947 2.921533 2.857712
## 5 2.723936 2.579510 3.314995
##
##
## $coef_G
## [1] 0.20 0.15 -0.30 0.10 0.05 0.20
##
## $coef_E
## [1] -0.45 0.35 0.20
##
## $coef_Z
## [1] 0.15 0.75 0.10
##
## $coef_main
## [1] -2.0 2.0 3.0 1.0 5.0 -1.5 2.0 2.0
Currently “data” contains the outcome and the true outcome without the noise. This dataset should always contain the outcome and all covariates not included in latent_var.The latent variables are called “G”, “E” and “Z” respectively.
We generate N=250 training observations.
We fit the model with the default starting point.
## Converged in 9 iterations
## $fit_main
##
## Call:
## stats::glm(formula = formula, family = family, data = data, model = FALSE,
## y = FALSE)
##
## Coefficients: (-9 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.73742 0.28854 -6.021 6.67e-09 ***
## G 3.69821 1.44085 2.567 0.010894 *
## E -2.88702 0.35440 -8.146 2.29e-14 ***
## Z 0.93184 0.09084 10.258 < 2e-16 ***
## G:E -6.43857 1.73999 -3.700 0.000269 ***
## G:Z 1.50043 0.45407 3.304 0.001102 **
## E:Z 1.41024 0.11155 12.643 < 2e-16 ***
## G:E:Z -1.87914 0.56505 -3.326 0.001025 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.072187)
##
## Null deviance: 2042.53 on 249 degrees of freedom
## Residual deviance: 249.82 on 233 degrees of freedom
## AIC: 745.29
##
## Number of Fisher Scoring iterations: 2
##
##
## $fit_G
##
## Call:
## stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-11 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## g1 0.18509 0.01039 17.806 < 2e-16 ***
## g2 0.13924 0.01288 10.808 < 2e-16 ***
## g3 -0.30681 0.01308 -23.456 < 2e-16 ***
## g4 0.09764 0.01067 9.154 < 2e-16 ***
## g1_g3 0.06429 0.02208 2.912 0.00394 **
## g2_g3 0.20693 0.02253 9.184 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.072171)
##
## Null deviance: 2042.53 on 250 degrees of freedom
## Residual deviance: 249.82 on 233 degrees of freedom
## AIC: 745.29
##
## Number of Fisher Scoring iterations: 2
##
##
## $fit_E
##
## Call:
## stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-14 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## e1 0.46366 0.01715 27.04 <2e-16 ***
## e2 -0.36535 0.01861 -19.63 <2e-16 ***
## e3 -0.17098 0.01612 -10.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.072175)
##
## Null deviance: 2042.53 on 250 degrees of freedom
## Residual deviance: 249.82 on 233 degrees of freedom
## AIC: 745.29
##
## Number of Fisher Scoring iterations: 2
##
##
## $fit_Z
##
## Call:
## stats::glm(formula = formula_step[[i]], family = family, data = data,
## model = FALSE, y = FALSE)
##
## Coefficients: (-14 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## z1 0.11464 0.03060 3.747 0.000226 ***
## z2 0.73843 0.03279 22.522 < 2e-16 ***
## z3 0.14692 0.03185 4.613 6.56e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.072175)
##
## Null deviance: 2042.53 on 250 degrees of freedom
## Residual deviance: 249.82 on 233 degrees of freedom
## AIC: 745.29
##
## Number of Fisher Scoring iterations: 2
Let’s add irrelevant genes and try a forward search as before. Note that search = 1 means that we will search for the first latent variable which is “G”. We could also set search = 0 to search through all latent variables.
\[\mathbf{g'}_b \sim Binomial(n=1,p=.30) \\ b = 1, 2, 3, 4, 5\]
g1_bad = rbinom(250,1,.30)
g2_bad = rbinom(250,1,.30)
g3_bad = rbinom(250,1,.30)
g4_bad = rbinom(250,1,.30)
g5_bad = rbinom(250,1,.30)
G_new = cbind(g1_bad, g2_bad, g3_bad, g4_bad, g5_bad)
forward_genes_BIC = stepwise_search_IM(train$data, latent_var_original=train$latent_var, latent_var_extra=list(G=G_new, E=NULL, Z=NULL), formula=y ~ E*G*Z, search_type="forward", search=1, search_criterion="BIC", interactive_mode=FALSE)
## Keeping only variables with p-values smaller than 0.2 and which inclusion decrease the AIC
## Forward search of the elements from G to find the model with the lowest BIC
##
## [Iteration: 1]
## No element from G was added
We didn’t add any irrelevant genes.