Sliced inverse regression

Denote the \(p\)-variate predictors \(x_i\), \(i=1,…,n\) with corresponding responses \(y_i\). The predictors are assumed to follow the model \[ x_i = A s_i + b, \] where \(A\) is a non-singular \(p \times p\) matrix, \(b\) a \(p\)-vector and the random vector \(s\) can be decomposed into \((s_1^T,s_2^T)^T\) with \(E(s)=0\) and \(Cov(s)=I_p\) where \(s_1\) has dimension \(k\) and \(s_2\) dimension \(p-k\). It is then assumed that \(s_1\) is the signal part, the interesting components, which are relevant to model \(y\), whereas \(s_2\) is the noise part.

The working model assumption is then that \[ (y,s_1^T)^T \ is \ independent \ of \ s_2. \] Defining \(S_1=E((x-b)(x-b)^T)\) and \(S_2=E(E(x-b|y)E(x-b|y)^T)\) the sliced inverse regression (SIR) can be interpreted as finding the transformation matrix \(W\) such that \[ W S_1 W^T = I_p \ and \ W S_2 W^T = D, \] where \(D\) is a diagonal matrix with diagonal elements \(d_1 \geq d_2 \geq … \geq d_k > d_{k+1} = … = d_p = 0\).

In practice \(S_2\) is estimated by approximating \(E(x-b|y)\) by dividing \(y\) into \(h\) slices where in this package \(y\) is divided into \(h\) intervals containing an equal number of observations.

The practical problem is to decide then the value \(k\).

Asymptotic test a specific value of \(k\)

For an asymptotic test using the test statistic \[ T= n\sum_{i=k+1}^p d_i, \] the limiting distribution under the null is then \(T \sim \chi^2_{(p-k)(h-k-1)}\). Therefore for the hypothetical value \(k\) and the number of slices \(h\) is required that \(k \geq h-1\).

Bootstrapping test a specific value of \(k\)

Bootstrap tests can be constructed as follows

  1. Compute \(\hat s_i = \hat W(x_i - \bar x_i)\), and decompose into \(\hat s_{i,1}\) and \(\hat s_{i,2}\).
  2. Sample a bootstrap sample \((y_i^*, \hat s_{i,1}^*)\) of size \(n\) from \((y_i, \hat s_{i,1})\).
  3. Sample a bootstrap sample \(\hat s_{i,2}^*\) of size \(n\) from \(\hat s_{i,2}\).
  4. Compute \(x_i^* = \hat W^{-1} ((\hat s_{i,1}^*)^T,(\hat s_{i,2}^*)^T)^T\).
  5. Recompute the test statistic \(T^*\) based on the bootstrap data \(x_i^*\), \(i=,1,…,n\).
  6. Repeat the steps above \(m\) times to obtain bootstrap test statistics \(T^*_1,…,T^*_m\).

The p-value is then \[ \frac{\#(T_i^* \geq T)+1}{m+1}. \]


Some simulated data with true \(k=2\):

n <- 200
p <- 10
X <- matrix(rnorm(p*n), ncol = p)
eps <- rnorm(n, sd=0.25)
y <- X[, 1]/ (0.5+(X[, 2]+1.5)^2)

plot of chunk unnamed-chunk-1

First performing the asymptotic test

SIRasympk2 <- SIRasymp(X,y,2)
##  SIR test for subspace dimension
## data:  X
## T = 65.121, df = 56, p-value = 0.1891
## alternative hypothesis: the last 8 eigenvalues are not zero

plot of chunk unnamed-chunk-2

Then the bootstrap test

SIRbootk2 <- SIRboot(X,y,2)
##  SIR bootstrapping test for subspace dimension
## data:  X
## T = 0.32561, replications = 200, p-value = 0.2388
## alternative hypothesis: the last 8 eigenvalues are not zero

Looking at the first two components and their relationship with the response

pairs(cbind(y, components(SIRbootk2, which="k")))

plot of chunk unnamed-chunk-4