Benchmarking the IncidencePrevalence R package

To check the performance of the IncidencePrevalence package we can use the benchmarkIncidencePrevalence(). This function generates some hypothetical study cohorts and the estimates incidence and prevalence using various settings and times how long these analyses take.

We can start for example by benchmarking our example mock data which uses duckdb.

library(IncidencePrevalence)
library(visOmopResults)
library(dplyr)
library(ggplot2)

cdm <- mockIncidencePrevalence(
  sampleSize = 100,
  earliestObservationStartDate = as.Date("2010-01-01"),
  latestObservationStartDate = as.Date("2010-01-01"),
  minDaysToObservationEnd = 364,
  maxDaysToObservationEnd = 364,
  outPre = 0.1
)

timings <- benchmarkIncidencePrevalence(cdm)
timings |>
  glimpse()
#> Rows: 4
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock"
#> $ group_name       <chr> "task", "task", "task", "task"
#> $ group_level      <chr> "generating denominator (8 cohorts)", "yearly point p…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall"
#> $ strata_level     <chr> "overall", "overall", "overall", "overall"
#> $ variable_name    <chr> "overall", "overall", "overall", "overall"
#> $ variable_level   <chr> "overall", "overall", "overall", "overall"
#> $ estimate_name    <chr> "time_taken_minutes", "time_taken_minutes", "time_tak…
#> $ estimate_type    <chr> "numeric", "numeric", "numeric", "numeric"
#> $ estimate_value   <chr> "0.13", "0.06", "0.06", "0.17"
#> $ additional_name  <chr> "dbms &&& person_n &&& min_observation_start &&& max_…
#> $ additional_level <chr> "duckdb &&& 100 &&& 2010-01-01 &&& 2010-12-31", "duck…

We can see our results like so:

visOmopTable(timings,
  hide = c(
    "variable_name", "variable_level",
    "strata_name", "strata_level"
  ),
  groupColumn = "task"
)
CDM name Dbms Person n Min observation start Max observation end Estimate name Estimate value
generating denominator (8 cohorts)
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.13
yearly point prevalence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.06
yearly period prevalence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.06
yearly incidence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.17

Results from test databases

Here we can see the results from the running the benchmark on test datasets on different databases management systems. These benchmarks have already been run so we’ll start by loading the results.

test_db <- IncidencePrevalenceBenchmarkResults
test_db |>
  glimpse()
#> Rows: 16
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ cdm_name         <chr> "ohdsi_postgres", "ohdsi_postgres", "ohdsi_postgres",…
#> $ group_name       <chr> "task", "task", "task", "task", "task", "task", "task…
#> $ group_level      <chr> "generating denominator (8 cohorts)", "yearly point p…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_level   <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ estimate_name    <chr> "time_taken_minutes", "time_taken_minutes", "time_tak…
#> $ estimate_type    <chr> "numeric", "numeric", "numeric", "numeric", "numeric"…
#> $ estimate_value   <chr> "0.81", "0.23", "0.23", "1.02", "1.2", "0.25", "0.24"…
#> $ additional_name  <chr> "dbms &&& person_n &&& min_observation_start &&& max_…
#> $ additional_level <chr> "postgresql &&& 1000 &&& 2008-01-01 &&& 2010-12-31", …
visOmopTable(bind(timings, test_db),
  hide = c(
    "variable_name", "variable_level",
    "strata_name", "strata_level"
  ),
  groupColumn = "task"
)
CDM name Dbms Person n Min observation start Max observation end Estimate name Estimate value
generating denominator (8 cohorts)
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.13
ohdsi_postgres postgresql 1000 2008-01-01 2010-12-31 time_taken_minutes 0.81
ohdsi_redshift redshift 1000 2007-12-15 2010-12-31 time_taken_minutes 1.20
ohdsi_sql_Server sql server 1000 2008-01-01 2010-12-31 time_taken_minutes 0.55
ohdsi_snowflake snowflake 116352 2007-11-27 2010-12-31 time_taken_minutes 2.03
yearly point prevalence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.06
ohdsi_postgres postgresql 1000 2008-01-01 2010-12-31 time_taken_minutes 0.23
ohdsi_redshift redshift 1000 2007-12-15 2010-12-31 time_taken_minutes 0.25
ohdsi_sql_Server sql server 1000 2008-01-01 2010-12-31 time_taken_minutes 0.18
ohdsi_snowflake snowflake 116352 2007-11-27 2010-12-31 time_taken_minutes 0.50
yearly period prevalence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.06
ohdsi_postgres postgresql 1000 2008-01-01 2010-12-31 time_taken_minutes 0.23
ohdsi_redshift redshift 1000 2007-12-15 2010-12-31 time_taken_minutes 0.24
ohdsi_sql_Server sql server 1000 2008-01-01 2010-12-31 time_taken_minutes 0.18
ohdsi_snowflake snowflake 116352 2007-11-27 2010-12-31 time_taken_minutes 0.37
yearly incidence for two outcomes with eight denominator cohorts
mock duckdb 100 2010-01-01 2010-12-31 time_taken_minutes 0.17
ohdsi_postgres postgresql 1000 2008-01-01 2010-12-31 time_taken_minutes 1.02
ohdsi_redshift redshift 1000 2007-12-15 2010-12-31 time_taken_minutes 1.38
ohdsi_sql_Server sql server 1000 2008-01-01 2010-12-31 time_taken_minutes 0.70
ohdsi_snowflake snowflake 116352 2007-11-27 2010-12-31 time_taken_minutes 2.49