Prototyping Distributed Computations

Introduction

The distcomp package allows one to perform computations on data that is row-partitioned data among several sites. Implemented methods include stratified Cox regression and singular value decomposition (SVD). The distributed computations rely on a REST API for R, such as that provided by Opencpu for example.

In order to prototype new methods for distcomp, it is convenient to write code that does all computations locally, at least during the development phase. This allows for easy debugging and code changes. The step to using a REST call is merely one step off. For example, in the context of distributed multiple imputation, the distributed SVD is a building block. So one has to be able to be able to run the svd operation not only in a modular fashion, but also quickly to test code.

Previous versions of distcomp assumed all computations were done using opencpu remote calls. This meant that the results were time-consuming to check; for reasonably sized problems one would have to wait for hours.

Version 1.1 of distcomp addresses this by allowing local computations, in essence faking the opencpu calls. All calculations are done locally at normal speeds (rather than a https POST request) while ensuring that the sites are only privy to their own data.

An Example

available <- availableComputations()
( computationNames <- names(available) )

## [1] "QueryCount"         "StratifiedCoxModel" "RankKSVD"

We will perform an SVD on a 1500 by 20 matrix where the matrix is distributed across three sites each having a 500 by 20 matrix. We will extract the first 10 components using this distributed algorithm.

We first define the computation.

svdDef <- data.frame(compType = "RankKSVD",
                     rank = 20L,
                     ncol = 20L,
                     id = "SVD",
                     stringsAsFactors = FALSE)

We generate some synthetic data.

set.seed(12345)
## Three sites
nSites <- 3
siteData <- lapply(seq.int(nSites), function(i) matrix(rnorm(10000), ncol=20))

We now create local objects to represent the sites, a list of three items, each with a name and its specific data.

sites <- lapply(seq.int(nSites),
                function(x) list(name = paste0("site", x),
                                 worker = makeWorker(defn = svdDef, data = siteData[[x]])
                                 ))

We are now ready to create a master object and add sites to it.

master <- makeMaster(svdDef)
for (site in sites) {
  master$addSite(name = site$name, worker = site$worker)
}

When any of the sites specify a worker argument in the call to addSite, it is an indication that the computation is to be performed locally. Otherwise, a URL will have to be specified using the url argument to addSite. (Only one of worker or url can be specified.)

Finally, we are ready to run the decomposition, specifying a maximum number of iterations, so that things don’t go haywire. This takes less than 5 seconds on my laptop.

system.time(result <- master$run(max.iter = 10000))

##    user  system elapsed 
##   1.777   0.199   1.975

The result is a named list of two things, the \(d\) diagonal values and the \(v\) matrix.

We can compare the results obtained here with the actual.

full_data <- do.call(rbind, siteData)
full_svd <- svd(full_data)

Let’s print it side-by-side.

d_table <- data.frame(truth = full_svd$d, distcomp = result$d)
knitr::kable(d_table)

truth	distcomp
41.82263	41.82263
41.65242	41.65242
41.32027	41.32027
40.88629	40.88629
40.41849	40.41849
40.27631	40.27631
40.12698	40.12698
39.44775	39.44775
39.24815	39.24815
39.07074	39.07074
38.37076	38.37076
38.21350	38.21350
37.61047	37.61047
37.14286	37.14286
36.98763	36.98763
36.24778	36.24778
36.04962	36.04962
35.66050	35.66050
35.21343	35.21343
34.56329	34.56329

We can also compare the \(v\) matrix, taking into account the signs may be different.

norm(abs(result$v) - abs(full_svd$v), "F")

## [1] 1.124008e-05

Stratified Cox Model

The approach for the stratified Cox model is similar. We start with a definition.

coxDef <- data.frame(compType = "StratifiedCoxModel",
                     formula = "Surv(time, censor) ~ age + becktota + ndrugfp1 + ndrugfp2 + ivhx3 + race + treat",
                     projectName = "STCoxTest",
                     projectDesc = "STCox Project Desc",
                     stringsAsFactors = FALSE)

We now create two sites with appropriate data

## Two sites
siteDataFiles <- file.path(system.file("ex", package="distcomp"), c("uis-site1.csv", "uis-site2.csv"))
siteData <- lapply(siteDataFiles, read.csv)

We now create local objects to represent the two sites, each with a name and its specific data.

sites <- lapply(seq_along(siteData),
                function(x) list(name = paste0("site", x),
                                 worker = makeWorker(defn = coxDef, data = siteData[[x]])
                                 ))

We are now ready to create a master object and add sites to it.

master <- makeMaster(coxDef)
for (site in sites) {
  master$addSite(name = site$name, worker = site$worker)
}

Now we can run the model.

result <- master$run()
print(master$summary(), digits = 4)

##        coef exp(coef) se(coef)      z         p
## 1 -0.028076    0.9723 0.008131 -3.453 5.542e-04
## 2  0.009146    1.0092 0.004991  1.832 6.691e-02
## 3 -0.521973    0.5933 0.124424 -4.195 2.727e-05
## 4 -0.194178    0.8235 0.048252 -4.024 5.717e-05
## 5  0.263634    1.3017 0.108243  2.436 1.487e-02
## 6 -0.240021    0.7866 0.115632 -2.076 3.792e-02
## 7 -0.212616    0.8085 0.093747 -2.268 2.333e-02

The above result will be the same as what we obtain from a Cox regression fit.

coxFit <- survival::coxph(formula=Surv(time, censor) ~ age + becktota + ndrugfp1 + ndrugfp2 +
                              ivhx3 + race + treat + strata(site),
                          data = rbind(siteData[[1]], siteData[[2]]))
summary(coxFit)$coefficients

##                   coef exp(coef)    se(coef)         z     Pr(>|z|)
## age       -0.028075893 0.9723146 0.008130685 -3.453078 5.542280e-04
## becktota   0.009145528 1.0091875 0.004991421  1.832250 6.691425e-02
## ndrugfp1  -0.521973047 0.5933487 0.124423881 -4.195119 2.727278e-05
## ndrugfp2  -0.194177573 0.8235117 0.048252289 -4.024215 5.716573e-05
## ivhx3TRUE  0.263634281 1.3016521 0.108243388  2.435569 1.486837e-02
## race      -0.240020862 0.7866115 0.115632433 -2.075723 3.791961e-02
## treat     -0.212616368 0.8084662 0.093747124 -2.267978 2.333058e-02

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] distcomp_1.3-4 survival_3.8-3
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-4      jsonlite_2.0.0    dplyr_1.1.4       compiler_4.5.2   
##  [5] promises_1.5.0    tidyselect_1.2.1  Rcpp_1.1.0        stringr_1.6.0    
##  [9] later_1.4.4       jquerylib_0.1.4   splines_4.5.2     yaml_2.3.10      
## [13] fastmap_1.2.0     sodium_1.4.0      mime_0.13         lattice_0.22-7   
## [17] R6_2.6.1          generics_0.1.4    knitr_1.50        tibble_3.3.0     
## [21] shiny_1.11.1      bslib_0.9.0       pillar_1.11.1     rlang_1.1.6      
## [25] stringi_1.8.7     cachem_1.1.0      httpuv_1.6.16     xfun_0.54        
## [29] sass_0.4.10       otel_0.2.0        cli_3.6.5         magrittr_2.0.4   
## [33] digest_0.6.39     grid_4.5.2        xtable_1.8-4      gmp_0.7-5        
## [37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.5    glue_1.8.0       
## [41] homomorpheR_0.3   httr_1.4.7        rmarkdown_2.30    tools_4.5.2      
## [45] pkgconfig_2.0.3   htmltools_0.5.8.1

Prototyping Distributed Computations

Balasubramanian Narasimhan

2025-12-03

Introduction

An Example

Stratified Cox Model

Session Info