This article covers core features of the `aorsf`

package.

The oblique random survival forest (ORSF) is an extension of the axis-based RSF algorithm.

The purpose of `aorsf`

(‘a’ is short for accelerated) is
to provide routines to fit ORSFs that will scale adequately to large
data sets. The fastest algorithm available in the package is the
accelerated ORSF model, which is the default method used by
`orsf()`

:

```
library(aorsf)
set.seed(329)
<- orsf(data = pbc_orsf,
orsf_fit formula = Surv(time, status) ~ . - id)
orsf_fit#> ---------- Oblique random survival forest
#>
#> Linear combinations: Accelerated
#> N observations: 276
#> N events: 111
#> N trees: 500
#> N predictors total: 17
#> N predictors per node: 5
#> Average leaves per tree: 25
#> Min observations in leaf: 5
#> Min events in leaf: 1
#> OOB stat value: 0.84
#> OOB stat type: Harrell's C-statistic
#> Variable importance: anova
#>
#> -----------------------------------------
```

you may notice that the first input of `aorsf`

is
`data`

. This is a design choice that makes it easier to use
`orsf`

with pipes (i.e., `%>%`

or
`|>`

). For instance,

```
library(dplyr)
<- pbc_orsf |>
orsf_fit select(-id) |>
orsf(formula = Surv(time, status) ~ .)
```

`aorsf`

includes several functions dedicated to
interpretation of ORSFs, both through estimation of partial dependence
and variable importance.

`aorsf`

provides multiple ways to compute variable
importance.

To compute negation importance, ORSF multiplies each coefficient of that variable by -1 and then re-computes the out-of-sample (sometimes referred to as out-of-bag) accuracy of the ORSF model.

`orsf_vi_negate(orsf_fit) #> bili age copper protime albumin ast #> 0.076370077 0.027401542 0.025057304 0.013023547 0.010002084 0.006355491 #> sex chol ascites spiders platelet edema #> 0.006042926 0.005782455 0.004584288 0.004323817 0.002396333 0.001638486 #> hepato stage trt trig #> 0.000573036 -0.001041884 -0.002031673 -0.004167535`

You can also compute variable importance using permutation, a more classical approach.

`orsf_vi_permute(orsf_fit) #> bili age protime albumin chol #> 0.0107834966 0.0090122942 0.0058345489 0.0033861221 0.0026568035 #> copper platelet ascites edema ast #> 0.0025005209 0.0019274849 0.0018753907 0.0013494875 0.0011981663 #> spiders stage sex hepato trt #> 0.0011460721 0.0007293186 0.0000000000 -0.0003646593 -0.0006251302 #> trig #> -0.0020316733`

A faster alternative to permutation and negation importance is ANOVA importance, which computes the proportion of times each variable obtains a low p-value (p < 0.01) while the forest is grown.

`orsf_vi_anova(orsf_fit) #> ascites bili edema copper albumin age protime #> 0.38326586 0.27203454 0.23833229 0.20161087 0.17501252 0.16944099 0.15411128 #> chol stage ast spiders hepato sex alk.phos #> 0.14196607 0.13971368 0.12955466 0.12152358 0.11651962 0.11271975 0.09598604 #> trig platelet trt #> 0.09560853 0.07760141 0.07129405`

Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.

For more on PD, see the vignette

Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.

For more on ICE, see the vignette

The original ORSF (i.e., `obliqueRSF`

) used
`glmnet`

to find linear combinations of inputs.
`aorsf`

allows users to implement this approach using the
`orsf_control_net()`

function:

```
<- orsf(data = pbc_orsf,
orsf_net formula = Surv(time, status) ~ . - id,
control = orsf_control_net(),
n_tree = 50)
```

`net`

forests fit a lot faster than the original ORSF
function in `obliqueRSF`

. However, `net`

forests
are still much slower than `cph`

ones:

```
# tracking how long it takes to fit 50 glmnet trees
print(
<- system.time(
t1 orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_net(),
n_tree = 50)
)
)#> user system elapsed
#> 1.30 0.02 1.31
# and how long it takes to fit 50 cph trees
print(
<- system.time(
t2 orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_cph(),
n_tree = 50)
)
)#> user system elapsed
#> 0.05 0.00 0.04
'elapsed'] / t2['elapsed']
t1[#> elapsed
#> 32.75
```

The unique feature of `aorsf`

is its fast algorithms to
fit ORSF ensembles. `RLT`

and `obliqueRSF`

both
fit oblique random survival forests, but `aorsf`

does so
faster. `ranger`

and `randomForestSRC`

fit
survival forests, but neither package supports oblique splitting.
`obliqueRF`

fits oblique random forests for classification
and regression, but not survival. `PPforest`

fits oblique
random forests for classification but not survival.

Note: The default prediction behavior for `aorsf`

models
is to produce predicted risk at a specific prediction horizon, which is
not the default for `ranger`

or `randomForestSRC`

.
I think this will change in the future, as computing time independent
predictions with `aorsf`

could be helpful.