`library(SPCompute)`

When the trait of interest is binary, we assume the following logistic regression model: \[\log\bigg(\frac{\text{P}(Y_i=1|X)}{1-\text{P}(Y_i=1|X)}\bigg) = X\beta,\] where the design matrix contains a column of \(1's\) for the intercept, a column of genotypes \(G\) and a column for non-genetic covariate \(E\) (optional).

The regression parameter vector \(\beta\) contains \(\beta_0\), \(\beta_G\) and \(\beta_E\), respectively represent intercept parameter, genetic effect and covariate effect.

To compute power or sample size, the user will need to specify the following information:

`preva`

, the prevalence rate of the disease in the population, defined as \(\text{P}(Y=1)\).`betaG`

, the true effect size of genetic effect.`pG`

, the minor allele frequency of the SNP.

If there exists non-genetic covariate \(E\) in the model, the user will also need to specify the following parameters:

`betaE`

, the true effect size of non-genetic covariate effect.`gammaG`

, the parameter that specifies the dependency between \(E\) and \(G\).

If the non-genetic covariate \(E\) is binary, the following marginal information on \(E\) should be specified:

`pE`

, the population prevalence rate of \(E\), defined as \(\text{P}(E=1)\). Otherwise if it is continuous, the required marginal information should be:`muE`

, the population mean of \(E\).`sigmaE`

, the population SD of \(E\).

These parameters should be summarized into a list, with appropriate names, such as the following when covariate is binary:

`<- list(preva = 0.2, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0) para `

To compute power given a sample size `n`

, the user can use
the function `Compute_Size`

, after specifying the argument
for:

`parameters`

, a list of true parameter values defined as above.`n`

, the given sample size`covariate`

, the type of covariate, should be “binary”, “continuous” or “none”.`mode`

, the genetic mode, should be “additive”, “dominant” or “recessive”.`alpha`

, the significance level.`method`

, the method used to do the computation. Should be “semi-sim” (faster for large sample) or “expand” (better for smaller sample).

For example:

```
Compute_Power(parameters = para, n = 2e4, covariate = "binary", mode = "additive", alpha = 0.05, method = "semi-sim")
#> [1] 0.6908458
```

or:

```
Compute_Power(parameters = para, n = 2e4, covariate = "binary", mode = "additive", alpha = 0.05, method = "expand")
#> [1] 0.6884145
```

Similarly, to compute the required sample size to achieve a certain
power, one just needs to change the argument `n`

to
`PowerAim`

, which defines the target power, and uses the
function `Compute_Size`

:

```
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "additive", alpha = 0.05, method = "semi-sim"))
#> [1] 25978
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "additive", alpha = 0.05, method = "expand"))
#> [1] 26128
```

or if the genetic mode is dominant:

```
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "semi-sim"))
#> [1] 30872
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "expand"))
#> [1] 30863
```

If one wants to have more accurate estimate of sample size or power
when using the method `semi-sim`

, the parameter
`B`

can be set to larger value, which will take longer
run-time:

```
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "semi-sim", B = 5e5))
#> [1] 30817
```

Unlike the case to compute power, when computing sample size, it is
always recommended to use the method `semi-sim`

, since the
method `expand`

will not work when the sample size is
extremely small, and will work at a much slower speed when the sample
size is extremely large.

When the trait of interest is continuous, we assume the following linear regression model: \[Y = X\beta + \epsilon,\] where the noise \(\epsilon\sim N(0,\sigma_\epsilon^2)\).

To compute power or sample size, the procedures will be always the
same as in the binary case, except that the parameter `preva`

will be replaced by the set of parameters:

`TraitMean`

, specifying the population mean of the continuous trait, i.e. \(\mathbb{E}(Y)\).`TraitSD`

, specifying the population standard deviation of the continuous trait.

The R function `Compute_Size`

or
`Compute_Power`

will then compute quantities such as \(\sigma_\epsilon\) using these marginal
information automatically.

Alternatively, the user may also inputs the values of \(\sigma_\epsilon\) directly instead of
inputting the value of `TraitSD`

, by replacing
`TraitSD`

with

`ResidualSD`

, the value of \(\sigma_\epsilon\) in the model.

Now the computation can be proceeded by specifying
`response = "continuous"`

:

```
<- list(TraitMean = 3, TraitSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8580452
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4270
```

Or:

```
<- list(TraitMean = 3, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
```

Note that for continuous trait, the value of `TraitMean`

is only used to do parameter conversion, which will not affect the
result of power or sample size computation. So if the purpose is not to
compute the converted parameter such as \(\beta_0\), the value of
`TraitMean`

can be set to arbitrary numeric value, as shown
in the following example:

```
<- list(TraitMean = 3, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
<- list(TraitMean = 30000, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
```

Similarly, when the covariate \(E\)
and the SNP \(G\) are independent
(i.e. `gammaG = 0`

), power or sample size computation will
not depend on the covariate information at all:

```
<- list(TraitMean = 0, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 1000, pE = 0.5, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
<- list(TraitMean = 0, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.001, pE = 0.1, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
```

When there are multiple covariates (\(E_1\) and \(E_2\)) simultaneously affecting the trait \(Y\), we assume the following (glm) model: \[g_1[\mathbb{E}(Y|E_1,E_2,G)] = \beta_0 + \beta_GG + \beta_{E_1}E_1 + \beta_{E_2}E_2.\]

The dependency between \(E_1, E_2\) and \(G\) are specified through the following (nested) second stage regressions: \[\begin{equation} \begin{aligned} g_2[\mathbb{E}(E_1|G)] &= \gamma_{01} + \gamma_{G_1}G\\ g_3[\mathbb{E}(E_2|G,E_1)] &= \gamma_{02} + \gamma_{G_2}G + \gamma_E E_1.\\ \end{aligned} \end{equation}\]

The covariates and the trait could be either continuous or binary, where the link functions \(g_1,g_2\) and \(g_3\) correspond to linear and logistic regression respectively as described in previous sections.

For example when \(Y\) is binary, \(E_1\) is binary (with \(\beta_{E_1} = 0.3,\ \text{P}(E_1=1)=0.3\)) and \(E_2\) is continuous (with \(\beta_{E_2} = 0.2,\ \mathbb{E}(E_2) = 0, \ \text{Var}(E_2) = 0.4\)), the power can be computed as:

```
<- list(preva = 0.2, pG = 0.1, betaG = 0.1, betaE = c(0.3, 0.2),
para pE = 0.3, gammaG = c(0.15,0.25), gammaE = 0.1,
muE = 0, sigmaE = sqrt(0.4))
Compute_Power_multi(parameters = para, n = 3000, mode = "additive",
covariate = c("binary", "continuous"), response = "binary")
#> [1] 0.159002
```

In other words, this correspond to the models: \[\begin{equation} \begin{aligned} \log\bigg(\frac{\text{P}(Y=1|E_1,E_2,G)}{1-\text{P}(Y=1|E_1,E_2,G)}\bigg) &= \beta_0 + 0.1 G + 0.3 E_1 + 0.2 E_2 \\ \log\bigg(\frac{\text{P}(E_1=1|G)}{1-\text{P}(E_1=1|G)}\bigg) &= \gamma_{01} + 0.15 G\\ \mathbb{E}(E_2|G,E_1) &= \gamma_{02} + 0.25 G + 0.1 E_1,\\ \end{aligned} \end{equation}\] where all the remaining parameters \(\bigg(\beta_0, \gamma_{01}, \gamma_{02}, \text{Var}[E_2|G,E_1] \bigg)\) are automatically solved using the marginal information on \(Y\), \(E_1\), \(E_2\) and \(G\).

Note that when `gammaE`

is unspecified, by the default it
is assumed the two covariates are conditionally independent given \(G\).