
Benchmarking the IncidencePrevalence R package
a07_benchmark.Rmd
To check the performance of the IncidencePrevalence package we can use the benchmarkIncidencePrevalence(). This function generates some hypothetical study cohorts and the estimates incidence and prevalence using various settings and times how long these analyses take.
We can start for example by benchmarking our example mock data which uses duckdb.
library(IncidencePrevalence)
library(visOmopResults)
library(dplyr)
library(ggplot2)
library(stringr)
cdm <- mockIncidencePrevalence(
sampleSize = 100,
earliestObservationStartDate = as.Date("2010-01-01"),
latestObservationStartDate = as.Date("2010-01-01"),
minDaysToObservationEnd = 364,
maxDaysToObservationEnd = 364,
outPre = 0.1
)
timings <- benchmarkIncidencePrevalence(cdm)
timings |>
glimpse()
#> Rows: 4
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1
#> $ cdm_name <chr> "mock", "mock", "mock", "mock"
#> $ group_name <chr> "task", "task", "task", "task"
#> $ group_level <chr> "generating denominator (8 cohorts)", "yearly point p…
#> $ strata_name <chr> "overall", "overall", "overall", "overall"
#> $ strata_level <chr> "overall", "overall", "overall", "overall"
#> $ variable_name <chr> "overall", "overall", "overall", "overall"
#> $ variable_level <chr> "overall", "overall", "overall", "overall"
#> $ estimate_name <chr> "time_taken_minutes", "time_taken_minutes", "time_tak…
#> $ estimate_type <chr> "numeric", "numeric", "numeric", "numeric"
#> $ estimate_value <chr> "0.1", "0.05", "0.05", "0.15"
#> $ additional_name <chr> "dbms &&& person_n &&& min_observation_start &&& max_…
#> $ additional_level <chr> "duckdb &&& 100 &&& 2010-01-01 &&& 2010-12-31", "duck…
We can see our results like so:
visOmopTable(timings,
hide = c(
"variable_name", "variable_level",
"strata_name", "strata_level"
),
groupColumn = "task"
)
CDM name | Dbms | Person n | Min observation start | Max observation end | Estimate name | Estimate value |
---|---|---|---|---|---|---|
generating denominator (8 cohorts) | ||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | time_taken_minutes | 0.10 |
yearly point prevalence for two outcomes with eight denominator cohorts | ||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | time_taken_minutes | 0.05 |
yearly period prevalence for two outcomes with eight denominator cohorts | ||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | time_taken_minutes | 0.05 |
yearly incidence for two outcomes with eight denominator cohorts | ||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | time_taken_minutes | 0.15 |
Results from test databases
Here we can see the results from the running the benchmark on test datasets on different databases management systems. These benchmarks have already been run so we’ll start by loading the results.
test_db <- IncidencePrevalenceBenchmarkResults |>
filter(str_detect(cdm_name, "CPRD", negate = TRUE))
test_db |>
glimpse()
#> Rows: 20
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "ohdsi_postgres", "ohdsi_postgres", "ohdsi_postgres",…
#> $ group_name <chr> "task", "task", "task", "task", "task", "task", "task…
#> $ group_level <chr> "generating denominator (8 cohorts)", "yearly point p…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ estimate_name <chr> "time_taken_minutes", "time_taken_minutes", "time_tak…
#> $ estimate_type <chr> "numeric", "numeric", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "1.55", "0.33", "0.36", "2.35", "1.39", "0.34", "0.36…
#> $ additional_name <chr> "dbms &&& person_n &&& min_observation_start &&& max_…
#> $ additional_level <chr> "postgresql &&& 1000 &&& 2008-01-01 &&& 2010-12-31", …
visOmopTable(bind(timings, test_db),
settingsColumn = "package_version",
hide = c(
"variable_name", "variable_level",
"strata_name", "strata_level"
),
groupColumn = "task"
)
CDM name | Dbms | Person n | Min observation start | Max observation end | Package version | Estimate name | Estimate value |
---|---|---|---|---|---|---|---|
generating denominator (8 cohorts) | |||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | 1.2.0 | time_taken_minutes | 0.10 |
ohdsi_postgres | postgresql | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 1.55 |
ohdsi_redshift | redshift | 1000 | 2007-12-15 | 2010-12-31 | 1.1.0 | time_taken_minutes | 1.39 |
ohdsi_sql_Server | sql server | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.75 |
ohdsi_snowflake | snowflake | 116352 | 2007-11-27 | 2010-12-31 | 1.1.0 | time_taken_minutes | 2.17 |
darwin_databricks_spark | spark | 2694 | 1908-09-22 | 2019-07-03 | 1.1.0 | time_taken_minutes | 4.71 |
yearly point prevalence for two outcomes with eight denominator cohorts | |||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | 1.2.0 | time_taken_minutes | 0.05 |
ohdsi_postgres | postgresql | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.33 |
ohdsi_redshift | redshift | 1000 | 2007-12-15 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.34 |
ohdsi_sql_Server | sql server | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.20 |
ohdsi_snowflake | snowflake | 116352 | 2007-11-27 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.82 |
darwin_databricks_spark | spark | 2694 | 1908-09-22 | 2019-07-03 | 1.1.0 | time_taken_minutes | 0.93 |
yearly period prevalence for two outcomes with eight denominator cohorts | |||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | 1.2.0 | time_taken_minutes | 0.05 |
ohdsi_postgres | postgresql | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.36 |
ohdsi_redshift | redshift | 1000 | 2007-12-15 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.36 |
ohdsi_sql_Server | sql server | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.21 |
ohdsi_snowflake | snowflake | 116352 | 2007-11-27 | 2010-12-31 | 1.1.0 | time_taken_minutes | 0.59 |
darwin_databricks_spark | spark | 2694 | 1908-09-22 | 2019-07-03 | 1.1.0 | time_taken_minutes | 1.03 |
yearly incidence for two outcomes with eight denominator cohorts | |||||||
mock | duckdb | 100 | 2010-01-01 | 2010-12-31 | 1.2.0 | time_taken_minutes | 0.15 |
ohdsi_postgres | postgresql | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 2.35 |
ohdsi_redshift | redshift | 1000 | 2007-12-15 | 2010-12-31 | 1.1.0 | time_taken_minutes | 2.58 |
ohdsi_sql_Server | sql server | 1000 | 2008-01-01 | 2010-12-31 | 1.1.0 | time_taken_minutes | 1.08 |
ohdsi_snowflake | snowflake | 116352 | 2007-11-27 | 2010-12-31 | 1.1.0 | time_taken_minutes | 5.36 |
darwin_databricks_spark | spark | 2694 | 1908-09-22 | 2019-07-03 | 1.1.0 | time_taken_minutes | 5.61 |
Results from real databases
Above we’ve seen performance on small test databases. However, more interesting is to know how the package performs on our actual patient-level data, which is often much larger. Below our results from running our benchmarking tasks against real patient datasets.
real_db <- IncidencePrevalenceBenchmarkResults |>
filter(str_detect(cdm_name, "CPRD"))
visOmopTable(real_db,
settingsColumn = "package_version",
hide = c(
"variable_name", "variable_level",
"strata_name", "strata_level"
),
groupColumn = "task"
)
CDM name | Dbms | Person n | Min observation start | Max observation end | Package version | Estimate name | Estimate value |
---|---|---|---|---|---|---|---|
generating denominator (8 cohorts) | |||||||
CPRD GOLD | postgresql | 17521504 | 1987-09-09 | 2024-06-15 | 1.1.0 | time_taken_minutes | 31.06 |
yearly point prevalence for two outcomes with eight denominator cohorts | |||||||
CPRD GOLD | postgresql | 17521504 | 1987-09-09 | 2024-06-15 | 1.1.0 | time_taken_minutes | 14.08 |
yearly period prevalence for two outcomes with eight denominator cohorts | |||||||
CPRD GOLD | postgresql | 17521504 | 1987-09-09 | 2024-06-15 | 1.1.0 | time_taken_minutes | 14.78 |
yearly incidence for two outcomes with eight denominator cohorts | |||||||
CPRD GOLD | postgresql | 17521504 | 1987-09-09 | 2024-06-15 | 1.1.0 | time_taken_minutes | 64.51 |
Sharing your benchmarking results
Sharing your benchmark results will help us improve the package. To run the benchmark, connect to your database and create your cdm reference. Then run the benchmark like below and export the results as a csv.
library(CDMConnector)
library(IncidencePrevalence)
cdm <- cdmFromCon("....")
timings <- benchmarkIncidencePrevalence(cdm)
exportSummarisedResult(
timings,
minCellCount = 5,
fileName = "results_{cdm_name}_{date}.csv",
path = getwd()
)