`DynForest`

with
survival outcome?`pbc2`

datasetWe use `DynForest`

on the `pbc2`

dataset (Murtaugh et al. 1994) to illustrate our
methodology. Data come from the clinical trial conducted by the Mayo
Clinic between 1974 and 1984. For the illustration, we consider a
subsample of the original dataset resulting to 312 patients and 7
predictors. Among these predictors, the level of serum bilirubin
(serBilir), aspartate aminotransferase (SGOT), albumin and alkaline were
measured at inclusion and during the follow-up leading to a total of
1945 observations. Sex, age and the drug treatment were collected at the
enrollment. During the follow-up, 140 patients died before
transplantation, 29 patients were transplanted and 143 patients were
alive. The time of first event (alive or any event) was considered as
the event time. We aim to predict in this illustration the death without
transplantation on patients suffering from primary billiary cholangitis
(PBC) using clinical and socio-demographic predictors, considering the
transplantation as a competing event.

To begin, we load `DynForest`

package and
`pbc2`

data and we split the subjects into two datasets: (i)
one dataset to train the random forest using \(2/3\) of patients; (ii) one dataset to
predict on the other \(1/3\) of
patients.

```
library("DynForest")
set.seed(1234)
<- unique(pbc2$id)
id <- sample(id, length(id)*2/3)
id_sample <- which(pbc2$id%in%id_sample)
id_row <- pbc2[id_row,]
pbc2_train <- pbc2[-id_row,] pbc2_pred
```

Then, we build the dataframe in the longitudinal format (i.e. one
observation per row) for the longitudinal predictors including:
`id`

the unique patient identifier; `time`

the
observed time measurements; `serBilir`

, `SGOT`

,
`albumin`

and `alkaline`

the longitudinal
predictors. We also build the dataframe with the time-fixed predictors
including: `id`

the unique patient identifier;
`age`

, `drug`

and `sex`

predictors
measured at enrollment. The nature of each predictor needs to be
properly defined with `as.factor()`

function for categorical
predictors (e.g. `drug`

and `sex`

).

```
<- pbc2_train[,c("id","time",
timeData_train "serBilir","SGOT",
"albumin","alkaline")]
<- unique(pbc2_train[,c("id","age","drug","sex")]) fixedData_train
```

The first step aims to build the random forest using the
`DynForest()`

function. We need to specify the mixed model of
each longitudinal predictor through a list containing the fixed and
random formula for the fixed effect and random effects of the mixed
models, respectively. To allow for a flexible trajectory over time,
splines can be used in formula using `splines`

package.

```
<- list(serBilir = list(fixed = serBilir ~ time,
timeVarModel random = ~ time),
SGOT = list(fixed = SGOT ~ time + I(time^2),
random = ~ time + I(time^2)),
albumin = list(fixed = albumin ~ time,
random = ~ time),
alkaline = list(fixed = alkaline ~ time,
random = ~ time))
```

Here, we assume a linear trajectory for `serBilir`

,
`albumin`

and `alkaline`

, and quadratic trajectory
for `SGOT.`

For this illustration, we build the outcome
object containing a list with `type`

set to `surv`

(for survival data) and `Y`

contains a dataframe in wide
format (one subject per row) with: `id`

the unique patient
identifier; `years`

the time-to-event data;
`event`

the event indicator.

```
<- list(type = "surv",
Y Y = unique(pbc2[,c("id","years","event")]))
```

We build the random forest using `DynForest()`

function
with the following code:

```
<- DynForest(timeData = timeData_train,
res_dyn fixedData = fixedData_train,
timeVar = "time", idVar = "id",
timeVarModel = timeVarModel, Y = Y,
ntree = 200, mtry = 3, nodesize = 2, minsplit = 3,
cause = 2, seed = 1234)
```

In a survival context with multiple events, it is also necessary to
specify the event of interest with the argument `cause`

. We
thus fixed `cause`

= 2 to specify the event of interest
(i.e. the death event). For the hyperparameters, we arbitrarily chose
`mtry`

= 3, `nodesize`

= 2 and
`minsplit`

= 3.

Overall information about the random forest can be output with
`summary()`

function. The summary for the random forest in
our illustration is displayed below:

```
summary(res_dyn)
survival (competing risk) mode
DynForest executed with : Fine & Gray statistic test
Splitting rule-of-bag error type: Integrated Brier Score
Out: Cumulative incidence function
Leaf statistic----------------
Input : 208
Number of subjects: 4 predictor(s)
Longitudinal: 1 predictor(s)
Numeric: 2 predictor(s)
Factor----------------
Tuning parameters : 3
mtry: 2
nodesize: 3
minsplit: 200
ntree----------------
----------------
DynForest summary : 6.61
Average depth by tree: 28.01
Average number of leaves by tree: 4.71
Average number of subjects by leaf: 1.91
Average number of events of interest by leaf----------------
-of-bag error based on Integrated Brier Score
Out-of-bag error: Not computed!
Out----------------
Time to build the random forest 3.000176 mins
Time difference of ----------------
```

We executed `DynForest()`

function in survival mode with
competing events. In this mode, we use the Fine & Gray statistic
test for the splitting rule and the cumulative incidence function (CIF)
for the leaf statistic. To build the random forest, we included 208
subjects with 4 longitudinal (`Longitudinal`

), 1 continuous
(`Numeric`

) and 2 categorical (`Factor`

)
predictors. The `summary()`

function also returns some
statistics about the trees. For instance, we have on average 4.7
subjects and 1.9 death events by leaf. The number of subjects by leaf
should always be higher than `nodesize`

hyperparameter. OOB
error should be first computed using `compute_OOBerror()`

function (see section below) to be displayed on summary output.

To further investigate the process in each tree, the split details for the tree 1 can be output with the following code:

```
head(res_dyn$rf[,1]$V_split)
type id_node var_split feature threshold N Nevent depth1 Longitudinal 1 3 1 -0.21993804 129 49 1
2 Longitudinal 2 2 1 5.57866304 26 21 2
3 Numeric 3 1 NA 61.83057715 103 28 2
4 Longitudinal 4 2 3 1.42021938 18 13 3
5 Factor 5 1 NA NA 8 8 3
6 Longitudinal 6 3 2 -0.01010312 92 22 3
```

```
tail(res_dyn$rf[,1]$V_split)
type id_node var_split feature threshold N Nevent depth48 Leaf 192 NA NA NA 4 2 8
49 Leaf 193 NA NA NA 2 2 8
50 Leaf 194 NA NA NA 2 1 8
51 Longitudinal 195 4 1 -27.58024 4 3 8
52 Leaf 390 NA NA NA 2 1 9
53 Leaf 391 NA NA NA 2 2 9
```

For instance for the interpretation of the node split, the subjects
were split at node 1 (`id_node`

) using the first
random-effect (`feature`

= 1) of the third
`Longitudinal`

predictor (`var_split`

= 3) with
`threshold`

= -0.2199. The predictor name can be found using
the predictor number in `timeData`

and `fixedData`

datasets. Therefore, the subjects at node 1 with `albumin`

values below to -0.2199 drop in node 2, otherwise in node 3. Another
example with the leaves, 4 subjects are included in the 192, and among
them 2 subjects have the event of interest.

Estimated cumulative incidence function (CIF) for a single tree can
be displayed using `Y_pred`

element of `rf`

. For
instance, the CIF of the cause of interest for leaf 192 can be displayed
using the following code:

```
plot(res_dyn$rf[,1]$Y_pred[[192]]$`2`, type = "l", col = "red",
xlab = "Years", ylab = "CIF", ylim = c(0,1))
```

However, CIF computed on a single tree is not relevant. It should also be computed over all trees of the random forest. For a subject, estimated CIF over the random forest is obtained by averaging the tree-specific CIF of the leaf where the subject belongs. For instance, for subject 104, we display in figure 1 the tree-specific CIF for the 9 first trees where this subject is used to grow the trees. This figure shows how the estimated CIF can be differ across the trees and requires to be averaged as they are obtained on a few subjects from the leaves in which subject 104 is assigned.

The Out-Of-Bag error (OOB) aims at assessing the prediction abilities
of the random forest. It is computed using
`compute_OOBerror()`

function with a `DynForest`

class object as main argument, such as:

`<- compute_OOBerror(DynForest_obj = res_dyn) res_dyn_OOB `

`compute_OOBerror()`

returns the OOB error by individual
(`oob.err`

) and the overall OOB error for the random forest
can be obtained by averaging the OOB error.

```
mean(res_dyn_OOB$oob.err)
1] 0.1238627 [
```

In a survival context, the OOB error is evaluated using the
Integrated Brier Score (IBS) (Gerds and
Schumacher 2006). We obtain an IBS of 0.124 computed from time 0
to the maximum event time. The time range can be modified using
`IBS.min`

and `IBS.max`

arguments to define the
minimum and maximum, respectively. To maximize the prediction ability of
the random forest, we want to minimize the OOB error. This can be done
by tuning the hyperparameters. OOB error can also be displayed using
`summary()`

function applied to the returning object.

The software allows to predict the outcome for a new subject using the trained random forest. Dynamic predictions can be computed by fixing a prediction time (called landmark time) from which prediction is made using all the history of the individual up to this landmark time (including the longitudinal and time-fixed predictors).

For the illustration, we only select the subjects still at risk at
the landmark time of 4 years. We build the dataframe for those subjects
and we predict the individual specific CIF using `predict()`

function as follows:

```
<- unique(pbc2_pred$id[which(pbc2_pred$years>4)])
id_pred <- pbc2_pred[which(pbc2_pred$id%in%id_pred),]
pbc2_pred_tLM <- pbc2_pred_tLM[,c("id","time",
timeData_pred "serBilir","SGOT",
"albumin","alkaline")]
<- unique(pbc2_pred_tLM[,c("id","age","drug","sex")])
fixedData_pred <- predict(object = res_dyn,
pred_dyn timeData = timeData_pred,
fixedData = fixedData_pred,
idVar = "id", timeVar = "time",
t0 = 4)
```

`predict()`

function provides several elements. To get
more graphical results, the `plot_CIF()`

function can be used
to display the CIF of death before transplantation for given subjects.
For instance, we computed the CIF for subjects 102 and 260 with the
following code and displayed them on the figure 2.

```
plot_CIF(DynForestPred_obj = pred_dyn,
id = c(102, 260))
```

The first year after the landmark time (at 4 years), we observe a rapid increase of the risk of death for subject 260 compared to subject 102. We also notice that after 10 years from landmark time, subject 260 has a probability of death almost three times higher that the one of subject 102.

The main objective of the random forest is to predict an outcome. But
usually, we are interested to identify which predictors are the most
predictive. The VIMP statistic can be computed using
`compute_VIMP()`

function. This function returns the VIMP
statistic for each predictor with `Importance`

element. These
results can also be displayed using `plot()`

function in
percentage with `PCT`

argument set to `TRUE`

.

```
<- compute_VIMP(DynForest_obj = res_dyn, seed = 123)
res_dyn_VIMP plot(x = res_dyn_VIMP, PCT = TRUE)
```

The VIMP results are displayed in figure 3A. The most predictive
variables are `serBilir`

, `albumin`

and
`age`

with the largest VIMP percentage. Without
`serBilir`

, the OOB error of prediction was reduced by
28%.

In the case of correlated predictors, the predictors can be regrouped
into dimensions and the VIMP can be computed at the dimension group
level with the gVIMP statistic. Permutation is done for each variable of
the group simultaneously. The gVIMP is computed with the
`compute_gVIMP()`

function. This function has the
`group`

argument to define the group of predictors as a list.
For instance, with two groups of predictors (named group1 and group2),
the gVIMP statistic is computed using the following code:

```
<- list(group1 = c("serBilir","SGOT"),
group group2 = c("albumin","alkaline"))
<- compute_gVIMP(DynForest_obj = res_dyn,
res_dyn_gVIMP group = group, seed = 123)
plot(x = res_dyn_gVIMP, PCT = TRUE)
```

Similar to VIMP statistic, the gVIMP results can be displayed using
`plot()`

function. The figure 3B shows that group1 has the
highest gVIMP percentage with 32%.

To compute the gVIMP statistic, the groups can be defined regardless of the number of predictors. However, the comparison between the groups may be harder when group sizes are very different.

To go further into the understanding of the tree building process,
the `var_depth()`

function extracts useful information about
the average minimal depth by feature (`min_depth`

), the
minimal depth for each feature and each tree
(`var_node_depth`

), the number of times that the feature is
used for splitting for each feature and each tree
(`var_count`

).

Using an object from `var_depth()`

function,
`plot()`

function allows to plot the distribution of the
average minimal depth across the trees. `plot_level`

argument
defines how the average minimal depth is plotted, by predictor or
feature.

```
<- DynForest(timeData = timeData_train,
res_dyn_max fixedData = fixedData_train,
timeVar = "time", idVar = "id",
timeVarModel = timeVarModel, Y = Y,
ntree = 200, mtry = 7, nodesize = 2, minsplit = 3,
cause = 2, seed = 1234)
<- var_depth(DynForest_obj = res_dyn_max)
depth_dyn plot(x = depth_dyn, plot_level = "predictor")
plot(x = depth_dyn, plot_level = "feature")
```

The distribution of the minimal depth level is displayed in figure 4
by predictor and feature. Note that the minimal depth level should
always be interpreted with the number of trees where the
predictor/feature is found. Indeed, to accurately appreciate the
importance of a variable minimal depth, it has to be part of the
candidates at each node. This is why we strongly advice to compute the
minimal depth on random forest with `mtry`

hyperparameter
chosen at its maximum.

In our example, we ran a random forest with `mtry`

hyperparameter set to its maximum (i.e. `mtry`

= 7) and we
computed the minimal depth on this random forest. We observe that
`serBilir`

, `albumin`

and `age`

have
the lowest minimal depth, indicating these predictors are used to split
the subjects at early stage in 200 out of 200 trees, i.e 100% (figure
4A). The minimal depth level by feature (figure 4B) provides more
advanced details about the tree building process. For instance, we can
see that the random-effects for `serBilir`

(indicating by bi0
and bi1 on the graph) are the earliest features used on 198 and 194 out
of 200 trees, respectively.

The predictive performance of the random forest strongly depends on
the hyperparameters `mtry`

, `nodesize`

and
`minsplit`

, and should therefore be chosen thoroughly.
`nodesize`

and `minsplit`

hyperparameters control
the tree depth. The trees need to be deep enough to ensure that the
predictions are accurate. By default in `DynForest()`

function, we fixed `nodesize`

= 1 and `minsplit`

=
2, being the minimum. However, with a large number of individuals, the
tree depth could be slightly decreased by increasing these
hyperparameters to reduce the computation time.

`mtry`

hyperparameter defines the number of predictors
randomly drawn at each node. By default, we chose `mtry`

equal to the square root of the number of predictors as usually
recommended (Bernard, Heutte, and Adam
2009). However, this hyperparameter should be carefully tuned
with possible values between 1 and the number of predictors. Indeed, the
predictive performance of the random forest is highly related to this
hyperparameter.

In the illustration, we tuned `mtry`

for every possible
values (1 to 7). The figure 5 displays the evolution of the OOB error
according to `mtry`

hyperparameter.

We can see on this figure large OOB error difference according to
`mtry`

hyperparameter. In particular, we observe the worst
predictive performance for lower values, then similar results with
values from 5 to 7. The optimal value (i.e. with the lowest OOB error)
was found with `mtry`

= 7. This graph reflects how it is
crucial to carefully tune this hyperparameter.

Bernard, Simon, Laurent Heutte, and Sébastien Adam. 2009.
“Influence of Hyperparameters on Random Forest Accuracy.”
In *International Workshop on Multiple Classifier Systems*,
171–80. Springer.

Gerds, Thomas A., and Martin Schumacher. 2006. “Consistent
Estimation of the Expected Brier
Score in General Survival
Models with Right-Censored
Event Times.” *Biometrical
Journal* 48 (6): 1029–40. https://doi.org/10.1002/bimj.200610301.

Murtaugh, Paul A., E. Rolland Dickson, Gooitzen M. Van Dam, Michael
Malinchoc, Patricia M. Grambsch, Alice L. Langworthy, and Chris H. Gips.
1994. “Primary Biliary Cirrhosis: Prediction of
Short-Term Survival Based on Repeated Patient Visits.”
*Hepatology* 20 (1): 126–34. https://doi.org/10.1002/hep.1840200120.