Example use of the OSCAR package

Teemu Daniel Laajala

2023-10-02

library(oscar)
## If you use oscar-methodology, please consider citing the following paper:
## Halkola AS, Joki K, Mirtti T, Mäkelä MM, Aittokallio T, Laajala TD (2023).
## OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer.
## PLoS Comput Biol 19(3): e1010333.
## https://doi.org/10.1371/journal.pcbi.1010333

OSCAR modelling

OSCAR (Optimal Subset CArdinality Regression) is an R-package delivering L0-quasinorm penalized regression models for Cox, logistic and gaussian model families. It comes with two different optimizers for the computationally demanding optimization task (DBDC and LMBM), and these have been implemented in Fortran. The R package handles all interaction with Fortran and provides many convenience and visualization functions to assist with L0-penalized regression modelling. This includes for example cross-validation, bootstrapping, and other diagnostics used to inspect an OSCAR model and assessing importance of features.

Please see Halkola et al. [1] for citing OSCAR. The primary solver for the DC-decomposition transforming the L0-pseudonorm optimization problem from discrete to continuous non-convex space is done via DBDC (‘Double Bundle method for nonsmooth DC optimization’), which is described in detail on Joki et al. [2], although an alternate optimization method for larger problems is offered via LMBM (‘Limited Memory Bundle Method for large-scale nonsmooth optimization’) as described in Haarala et al. [3].

Cox regression (simulated TYKS real-world PCa data)

A simulated real-world hospital cohort data with patient survival with prostate cancer (PCa) is provided with:

data(ex)
ex_X[1:7,1:7]
##             BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE   HB
## MEDISIMU1 29.39      175       90        142          77    68 10.9
## MEDISIMU2 19.37      176       60        107          77    58 13.3
## MEDISIMU3 23.88      165       65        142          77    71 11.8
## MEDISIMU4 34.54      176      107        126          90    71 13.1
## MEDISIMU5 20.65      188       73        142          77    71 15.3
## MEDISIMU6 28.41      174       86        188          77    88 12.8
## MEDISIMU7 29.63      180       96        142          77    81 14.1
head(ex_Y)
##           Time Event
## MEDISIMU1   89     0
## MEDISIMU2  754     1
## MEDISIMU3  783     1
## MEDISIMU4  159     0
## MEDISIMU5 1322     0
## MEDISIMU6  200     1

For further details regarding this dataset and its generation, see reference Laajala et at. [4]

One key feature in OSCAR in addition to its L0 norm penalization is its capability to handle variables as groups, or ‘kits’. This grouping of variables causes them to be selected (or omitted) together. In many clinical applications this is often reality; for example, curating initial data for variables together can occur in cases such as:

In the example data from Turku University Hospital (TYKS), a series of clinical variables are run together. For example a standard blood panel includes red blood cell count, hematocrite, white blood cell count, etc. Additionally, single variables can be run separately; a very common single marker for PCa is Prostate-Specific Antigen (PSA).

The kit structure for such variables that can be obtained together are modelled using a kit indicator matrix:

ex_K[1:7,1:7]
##            BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE HB
## BaseReport   1        1        1          1           1     1  0
## B-PVKT       0        0        0          0           0     0  1
## TKD          0        0        0          0           0     0  0
## cB-Het-Ion   0        0        0          0           0     0  0
## Pt-GFReEPI   0        0        0          0           0     0  0
## P-LD         0        0        0          0           0     0  0
## P-ASAT       0        0        0          0           0     0  0
apply(ex_K, MARGIN=1, FUN=sum) # Row indicator sums
## BaseReport     B-PVKT        TKD cB-Het-Ion Pt-GFReEPI       P-LD     P-ASAT 
##          6          5          8          1          1          1          1 
##      P-Alb     P-AFOS      P-PSA     P-Krea Pt-Krea-Cl     B-Lymf       P-Pi 
##          1          1          1          1          1          1          1 
##       P-Ca     P-ALAT    S -Prot      B-Eos    S-Testo     P-Gluk       P-Mg 
##          1          1          1          1          1          1          1 
##      P-Bil 
##          1
apply(ex_K, MARGIN=2, FUN=sum) # Column indicator sums
##         BMI    HEIGHTBL    WEIGHTBL  SYSTOLICBP DIASTOLICBP       PULSE 
##           1           1           1           1           1           1 
##          HB       HEMAT         RBC         WBC         PLT   LYMperLEU 
##           1           1           1           1           1           1 
##  MONOperLEU         NEU         POT        MONO  BASOperLEU   EOSperLEU 
##           1           1           1           1           1           1 
##   NEUperLEU         NA.        CCRC         LDH         AST         ALB 
##           1           1           1           1           1           1 
##         ALP         PSA       CREAT      CREACL         LYM        PHOS 
##           1           1           1           1           1           1 
##          CA         ALT        TPRO         EOS       TESTO         GLU 
##           1           1           1           1           1           1 
##          MG       TBILI 
##           1           1

If the kit matrix is a diagonal indicator matrix with dimensions equal to the number of variables, each variable is L0-penalized without any grouping structure.

Further, these kits can be assigned a ‘price’. This may be a price literally, or a descriptive value indicating how hard the variable is to obtain. With this information, a clinically feasible pareto-front can be constructed, that simultaneously captures modelling generalization as well as clinical applicability.

A cost vector can be provided for the fitting procedure, and a standardized arbitrary unit cost vector is provided in:

head(ex_c)
## BaseReport     B-PVKT        TKD cB-Het-Ion Pt-GFReEPI       P-LD 
##        0.0        2.4        4.8        5.7        1.0        1.1

OSCAR comes with two different optimizers:

These are provided to the fitting procedure with solver=1 and solver=2 or with their abbreviations. They have been implemented in Fortran for computational efficiency.

A quick example run of OSCAR with LMBM together with the default kit structure for Cox regression:

set.seed(1)
fit <- oscar(x=ex_X, y=ex_Y, k=ex_K, w=ex_c, family="cox", solver="LMBM")
fit

A pareto-optimal front provides information on potential optimal saddle points, where model generalization and clinical applicability pair nicely:

oscar.pareto.visu(fit=fit)
Pareto front for the clinical measurements, their total cost and C-index.

Pareto front for the clinical measurements, their total cost and C-index.

Goodness of fit vs. model coefficient cost

Model convergence in target function is important, while an another axis can be plotted simultaneously to provide supporting information:

oscar.visu(fit, y=c("target", "cost"))
Target function value as a function of k.

Target function value as a function of k.

Cross-validation and bootstrapping

Fit model object can be inspected based on visual diagnostics such as cross-validation and bootstrapping:

# Perform 5-fold cross-validation to find out optimal k
cv <- oscar.cv(fit, fold=5, seed=123)
# Visualize model generalization performance as a function of k
oscar.cv.visu(cv)
5-fold cross-validation for optimally generalizable cardinality 'k'

5-fold cross-validation for optimally generalizable cardinality ‘k’

It appears that k-values around 5 or above present a shoulder-point saturate the model, with no further coefficients contributing to the model’s generalization capability.

Further, there is some uncertainty as to what order the coefficients should be non-zero, based on bootstrapping:

# Bootstrap original data 20 times (sampling with replacement and refitting)
bs <- oscar.bs(fit, bootstrap=20, seed=234)
# Visualize bootstrapped models
oscar.bs.plot(fit=fit, bs=bs, nbins=20)
Selected variables over a set of bootstrap runs

Selected variables over a set of bootstrap runs

Based on the bootstrapping, PSA is a clear winner. Close second candidate for survival prediction is Alkaline Phosphatase (ALP). After these single markers the bootstrapping was quite prone to choosing separate kits between immunomarkers and standard bloodwork, with no clear consistent winner.

Further useful commands

Useful commands for understanding the model and the oscar S4-object:

coef(fit, k=3) # All potential coefficients at cardinality k=3
##         BMI    HEIGHTBL    WEIGHTBL  SYSTOLICBP DIASTOLICBP       PULSE 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          HB       HEMAT         RBC         WBC         PLT   LYMperLEU 
## -0.17302102 -5.04043911 -0.07065200  0.53173174 -0.00119069  0.00000000 
##  MONOperLEU         NEU         POT        MONO  BASOperLEU   EOSperLEU 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##   NEUperLEU         NA.        CCRC         LDH         AST         ALB 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##         ALP         PSA       CREAT      CREACL         LYM        PHOS 
##  0.66640203  0.40782914  0.00000000  0.00000000  0.00000000  0.00000000 
##          CA         ALT        TPRO         EOS       TESTO         GLU 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          MG       TBILI 
##  0.00000000  0.00000000
feat(fit, k=3) # All features chosen at cardinality k=3
##          HB       HEMAT         RBC         WBC         PLT         ALP 
## -0.17302102 -5.04043911 -0.07065200  0.53173174 -0.00119069  0.66640203 
##         PSA 
##  0.40782914
cost(fit, k=3) # Kit sum costs at various k cardinalities, here with cardinality k=3
## [1] 9.1

Other model families

While oscar-package was originally designed with biomedical survival data in mind, it naturally supports modelling of non-survival data. Currently, the family parameter supports logistic regression for binary outcomes, and mse/gaussian/normal for modelling responses that are normally distributed.

Gaussian regression (swiss-data)

# Use example swiss-data for quickness
data(swiss)
set.seed(2)
fit_swiss <- oscar(x=swiss[,-1], y=swiss[,1], family="mse", print=0, solver=2)

# Plot model coefficients as a function of cardinality k
plot(fit_swiss) 
Model coefficients as a function of cardinality 'k'.

Model coefficients as a function of cardinality ‘k’.

Similarly bootstrapping and cross-validation:

# Bootstrap original data 50 times (sampling with replacement and refitting)
bs_swiss <- oscar.bs(fit_swiss, bootstrap=50, seed=234)
# Visualize trajectories of bootstrapped coefficients
oscar.bs.plot(fit=fit_swiss, bs=bs_swiss, nbins=50)
Bootstrapping of Swiss fertility data (Gaussian/MSE)

Bootstrapping of Swiss fertility data (Gaussian/MSE)

# Perform 5-fold cross-validation to find out optimal k
cv_swiss <- oscar.cv(fit_swiss, fold=10, seed=345)
# Visualize model generalization performance as a function of k
oscar.cv.visu(cv_swiss)
10-fold cross-validation for optimally generalizable cardinality 'k'

10-fold cross-validation for optimally generalizable cardinality ‘k’

References

  1. Halkola AS, Joki K, Mirtti T, M{\"a}kel{\"a} MM, Aittokallio T, Laajala TD (2023) OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer. PLoS Comput Biol 19(3): e1010333. https://doi.org/10.1371/journal.pcbi.1010333

  2. Joki K, Bagirov AM, Karmitsa N, Makela MM, Taheri S. Double Bundle Method for finding Clarke Stationary Points in Nonsmooth DC Programming. SIAM Journal on Optimization. 2018;28(2):1892-1919.

  3. Haarala M, Miettinen K, Makela MM. New limited memory bundle method for large-scale nonsmooth optimization. Optimization Methods and Software. 2004;19(6):673-692.

  4. Laajala TD, Murtoj{\"a}rvi M, Virkki A, Aittokallio T. ePCR: an R-package for survival and time-to-event prediction in advanced prostate cancer, applied to real-world patient cohorts. Bioinformatics. 2018;34(22):3957-3959.

Session info

sessionInfo()
## R Under development (unstable) (2023-09-30 r85239 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=C                     LC_CTYPE=English_Finland.utf8   
## [3] LC_MONETARY=English_Finland.utf8 LC_NUMERIC=C                    
## [5] LC_TIME=English_Finland.utf8    
## 
## time zone: Europe/Helsinki
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] oscar_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] pROC_1.18.4     digest_0.6.33   R6_2.5.1        fastmap_1.1.1  
##  [5] Matrix_1.6-1.1  xfun_0.39       lattice_0.21-8  splines_4.4.0  
##  [9] cachem_1.0.8    knitr_1.44      htmltools_0.5.5 rmarkdown_2.23 
## [13] cli_3.6.1       hamlet_0.9.6    grid_4.4.0      sass_0.4.7     
## [17] jquerylib_0.1.4 compiler_4.4.0  plyr_1.8.8      tools_4.4.0    
## [21] evaluate_0.21   bslib_0.5.0     survival_3.5-7  Rcpp_1.0.11    
## [25] yaml_2.3.7      rlang_1.1.1     jsonlite_1.8.7