`DynForest`

with
categorical outcome?`pbc2`

datasetWe use `DynForest`

on the `pbc2`

dataset (Murtaugh et al. 1994) to illustrate our
methodology. Data come from the clinical trial conducted by the Mayo
Clinic between 1974 and 1984. For the illustration, we consider a
subsample of the original dataset resulting to 312 patients and 7
predictors. Among these predictors, the level of serum bilirubin
(serBilir), aspartate aminotransferase (SGOT), albumin and alkaline were
measured at inclusion and during the follow-up leading to a total of
1945 observations. Sex, age and the drug treatment were collected at the
enrollment. During the follow-up, 140 patients died before
transplantation, 29 patients were transplanted and 143 patients were
alive. The time of first event (alive or any event) was considered as
the event time. We aim to predict in this illustration the death without
transplantation on patients suffering from primary billiary cholangitis
(PBC) using clinical and socio-demographic predictors, considering the
transplantation as a competing event.

For the illustration, we select patients still at risk at 4 years and
we recode the `event`

variable with `event`

= 1
for subjects died during between 4 years and 10 years,
`event`

= 0 otherwise. We split the subjects into two
datasets: (i) one dataset to train the random forest using \(2/3\) of patients; (ii) one dataset to
predict on the other \(1/3\) of
patients.

```
library("DynForest")
<- pbc2[which(pbc2$years>4&pbc2$time<=4),]
pbc2 $event <- ifelse(pbc2$event==2, 1, 0)
pbc2$event[which(pbc2$years>10)] <- 0
pbc2set.seed(1234)
<- unique(pbc2$id)
id <- sample(id, length(id)*2/3)
id_sample <- which(pbc2$id%in%id_sample)
id_row <- pbc2[id_row,]
pbc2_train <- pbc2[-id_row,] pbc2_pred
```

Then, we build the dataframe in the longitudinal format (i.e. one
observation per row) for the longitudinal predictors including:
`id`

the unique patient identifier; `time`

the
observed time measurements; `serBilir`

, `SGOT`

,
`albumin`

and `alkaline`

the longitudinal
predictors. We also build the dataframe with the time-fixed predictors
including: `id`

the unique patient identifier;
`age`

, `drug`

and `sex`

predictors
measured at enrollment. The nature of each predictor needs to be
properly defined with `as.factor()`

function for categorical
predictors (e.g. `drug`

and `sex`

).

```
<- pbc2_train[,c("id","time",
timeData_train "serBilir","SGOT",
"albumin","alkaline")]
<- unique(pbc2_train[,c("id","age","drug","sex")]) fixedData_train
```

The first step aims to build the random forest using the
`DynForest()`

function. We need to specify the mixed model of
each longitudinal predictor through a list containing the fixed and
random formula for the fixed effect and random effects of the mixed
models, respectively. To allow for a flexible trajectory over time,
splines can be used in formula using `splines`

package.

```
<- list(serBilir = list(fixed = serBilir ~ time,
timeVarModel random = ~ time),
SGOT = list(fixed = SGOT ~ time + I(time^2),
random = ~ time + I(time^2)),
albumin = list(fixed = albumin ~ time,
random = ~ time),
alkaline = list(fixed = alkaline ~ time,
random = ~ time))
```

Here, we assume a linear trajectory for `serBilir`

,
`albumin`

and `alkaline`

, and quadratic trajectory
for `SGOT.`

Using categorical outcome, we should specify
`type`

=“factor” and the dataframe in `Y`

should
contain only 2 columns, the variable identifier `id`

and the
outcome `death`

.

```
<- list(type = "factor",
Y Y = unique(pbc2_train[,c("id","event")]))
```

We executed `DynForest()`

function to build the random
forest with hyperparameters `mtry`

= 7 and
`nodesize`

= 2 as follows:

```
<- DynForest(timeData = timeData_train,
res_dyn fixedData = fixedData_train,
timeVar = "time", idVar = "id",
timeVarModel = timeVarModel,
mtry = 7, nodesize = 2,
Y = Y, seed = 1234)
```

With categorical outcome, the OOB prediction error is evaluated using
a missclassification criterion. This criterion can be computed with
`compute_OOBerror()`

function and we displayed the results of
the random forest using `summary()`

as below:

```
<- compute_OOBerror(DynForest_obj = res_dyn)
res_dyn_OOB summary(res_dyn_OOB)
DynForest executed with classification mode : Minimize weighted within-group Shannon entropy
Splitting rule-of-bag error type: Missclassification
Out: Majority vote
Leaf statistic----------------
Input : 150
Number of subjects: 4 predictor(s)
Longitudinal: 1 predictor(s)
Numeric: 2 predictor(s)
Factor----------------
Tuning parameters : 7
mtry: 2
nodesize: 200
ntree----------------
----------------
DynForest summary : 5.85
Average depth by tree: 16.7
Average number of leaves by tree: 9.31
Average number of subjects by leaf----------------
-of-bag error based on Missclassification
Out-of-bag error: 0.2333
Out----------------
Time to build the random forest 1.808804 mins
Time difference of ----------------
```

In this illustration, we built the random forest using 150 subjects
because we only kept the subjects still at risk at landmark time at 4
years. We have on average 9.3 subjects by leaf, and the average depth
level by tree is 5.8. We also predicted the wrong outcome for 23% of the
subjects. This criterion should be minimized as possible, by tuning
`mtry`

and `nodesize`

hyperparameters.

We then predict the probability of death on subjects still at risk at
landmark time at 4 years. In classification mode, the predictions are
performed using majority vote. The prediction over the trees is thus a
modality of the categorical outcome along with the proportion of the
trees which lead to this modality. Prediction are computed using
`predict()`

function, then a dataframe can be easily built
from returning object to get the prediction and probability for each
subject as followed:

```
<- pbc2_pred[,c("id","time",
timeData_pred "serBilir","SGOT",
"albumin","alkaline")]
<- unique(pbc2_pred[,c("id","age","drug","sex")])
fixedData_pred <- predict(object = res_dyn,
pred_dyn timeData = timeData_pred,
fixedData = fixedData_pred,
idVar = "id", timeVar = "time",
t0 = 4)
head(data.frame(pred = pred_dyn$pred_indiv,
proba = pred_dyn$pred_indiv_proba))
pred proba101 0 0.960
104 0 0.780
106 1 0.540
108 0 0.945
112 1 0.515
114 0 0.645
```

As shown in this example, some predictions are made with varying confidence from 51.5% for subject 112 to 96.0% for subject 101. We predict no event for subject 101 with a probability of 96.0% and an event for subject 106 with a probability of 54.0%.

The most predictive variables can be computed using
`compute_VIMP()`

and displayed using `plot()`

function as followed:

```
<- compute_VIMP(DynForest_obj = res_dyn_OOB, seed = 123)
res_dyn_VIMP plot(x = res_dyn_VIMP, PCT = TRUE)
```

Again, we found that the most predictive variable is
`serBilir`

for which the OOB prediction error was reduced by
23%.

The minimal depth is computed using `var_depth()`

function
and is displayed at predictor and feature level using
`plot()`

function. The results are displayed in figure 1
using the random forest with maximal `mtry`

hyperparameter
value (i.e. `mtry`

= 7) for better understanding.

```
<- var_depth(DynForest_obj = res_dyn_OOB)
depth_dyn plot(x = depth_dyn, plot_level = "predictor")
plot(x = depth_dyn, plot_level = "feature")
```

We observe that `serBilir`

and `albumin`

have
the lowest minimal depth: these predictors are used to split the
subjects in 199 and 194 out of 200 trees, respectively (figure 1A). The
figure 1B provides further results. In particular, this graph shows the
baseline random-effect (indicated by bi0) of `serBilir`

and
`albumin`

are the earliest predictors used to split the
subjects with 196 and 189 out of 200 trees, respectively.

Murtaugh, Paul A., E. Rolland Dickson, Gooitzen M. Van Dam, Michael
Malinchoc, Patricia M. Grambsch, Alice L. Langworthy, and Chris H. Gips.
1994. “Primary Biliary Cirrhosis: Prediction of
Short-Term Survival Based on Repeated Patient Visits.”
*Hepatology* 20 (1): 126–34. https://doi.org/10.1002/hep.1840200120.