The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is a commonly used format for storing and analyzing observational health data derived from electronic health records, insurance claims, registries, and other sources. Source data is “mapped” into the OMOP CDM format providing researchers with a standardized interface for querying and analyzing observational health data. The CDMConnector package provides tools for working with OMOP Common Data Model (CDM) tables using familiar dplyr syntax and using the tidyverse design principles popular in the R ecosystem.
This vignette is for new users of CDMConnector who have access to data already mapped into the OMOP CDM format. However, CDMConnector does provide several example synthetic datasets in the OMOP CDM format. To learn more about the OMOP CDM or the mapping process check out these resources.
Creating a reference to the OMOP CDM
Typically OMOP CDM datasets are stored in a database and can range in size from hundreds of patients with thousands of records to hundreds of millions of patients with billions of records. The Observational Health Data Science and Infromatics (OHDSI) community supports a selection of popular database platforms including Postgres, Microsoft SQL Server, Oracle, as well as cloud data platforms suchs as Amazon Redshift, Google Big Query, Databricks, and Snowflake. The first step in using CDMConnector is to create a connection to your database from R. This can take some effort the first time you set up drivers. See the “Database Connection Examples” vignette or check out the Posit’s database documentation.
In our example’s we will use some synthetic data from the Synthea project
that has been mapped to the OMOP CDM format. We’ll use the duckdb database which is a file based
database similar to SQLite but with better date type support. To see all
the example datasets available run example_datasets()
.
library(CDMConnector)
example_datasets()
#> [1] "GiBleed" "synthea-allergies-10k"
#> [3] "synthea-anemia-10k" "synthea-breast_cancer-10k"
#> [5] "synthea-contraceptives-10k" "synthea-covid19-10k"
#> [7] "synthea-covid19-200k" "synthea-dermatitis-10k"
#> [9] "synthea-heart-10k" "synthea-hiv-10k"
#> [11] "synthea-lung_cancer-10k" "synthea-medications-10k"
#> [13] "synthea-metabolic_syndrome-10k" "synthea-opioid_addiction-10k"
#> [15] "synthea-rheumatoid_arthritis-10k" "synthea-snf-10k"
#> [17] "synthea-surgery-10k" "synthea-total_joint_replacement-10k"
#> [19] "synthea-veteran_prostate_cancer-10k" "synthea-veterans-10k"
#> [21] "synthea-weight_loss-10k" "empty_cdm"
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir("GiBleed"))
DBI::dbListTables(con)
#> [1] "care_site" "cdm_source" "concept"
#> [4] "concept_ancestor" "concept_class" "concept_relationship"
#> [7] "concept_synonym" "condition_era" "condition_occurrence"
#> [10] "cost" "death" "device_exposure"
#> [13] "domain" "dose_era" "drug_era"
#> [16] "drug_exposure" "drug_strength" "fact_relationship"
#> [19] "location" "measurement" "metadata"
#> [22] "note" "note_nlp" "observation"
#> [25] "observation_period" "payer_plan_period" "person"
#> [28] "procedure_occurrence" "provider" "relationship"
#> [31] "source_to_concept_map" "specimen" "visit_detail"
#> [34] "visit_occurrence" "vocabulary"
If you’re using CDMConnector for the first time you may get a message
about adding an enviroment vairable EUNOMIA_DATA_FOLDER
.
To do this simply create a new text file in your home directory called
.Renviron and add the line
EUNOMIA_DATA_FOLDER="path/to/folder/where/we/can/store/example/data"
.
If you run usethis::edit_r_environ()
this file will be
created and opened for you and opened in RStudio.
After connecting to a database containing data mapped to the OMOP
CDM, use cdm_from_con
to create a CDM reference. This CDM
reference is a single object that contains dplyr table references to
each CDM table along with metadata about the CDM instance.
The cdm_schema is the schema in the database that contains the OMOP CDM tables and is required. All other arguments are optional.
cdm <- cdm_from_con(con, cdm_schema = "main")
cdm
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship, location, care_site, provider, payer_plan_period, cost, drug_era, dose_era, condition_era, metadata, cdm_source, concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map, drug_strength
cdm$observation_period
#> # Source: table<observation_period> [?? x 5]
#> # Database: DuckDB v0.9.2 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpjoA8Eb/file9aaa2292c32a.duckdb]
#> observation_period_id person_id observation_period_s…¹ observation_period_e…²
#> <int> <int> <date> <date>
#> 1 6 6 1963-12-31 2007-02-06
#> 2 13 13 2009-04-26 2019-04-14
#> 3 27 27 2002-01-30 2018-11-21
#> 4 16 16 1971-10-14 2017-11-02
#> 5 55 55 2009-05-30 2019-03-23
#> 6 60 60 1990-11-21 2019-01-23
#> 7 42 42 1909-11-03 2019-03-13
#> 8 33 33 1986-05-12 2018-09-10
#> 9 18 18 1965-11-17 2018-11-07
#> 10 25 25 2007-03-18 2019-04-07
#> # ℹ more rows
#> # ℹ abbreviated names: ¹observation_period_start_date,
#> # ²observation_period_end_date
#> # ℹ 1 more variable: period_type_concept_id <int>
Individual CDM table references can be accessed using `$`.
cdm$person %>%
dplyr::glimpse()
#> Rows: ??
#> Columns: 18
#> Database: DuckDB v0.9.2 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpjoA8Eb/file9aaa2292c32a.duckdb]
#> $ person_id <int> 6, 123, 129, 16, 65, 74, 42, 187, 18, 111,…
#> $ gender_concept_id <int> 8532, 8507, 8507, 8532, 8532, 8532, 8532, …
#> $ year_of_birth <int> 1963, 1950, 1974, 1971, 1967, 1972, 1909, …
#> $ month_of_birth <int> 12, 4, 10, 10, 3, 1, 11, 7, 11, 5, 8, 3, 3…
#> $ day_of_birth <int> 31, 12, 7, 13, 31, 5, 2, 23, 17, 2, 19, 13…
#> $ birth_datetime <dttm> 1963-12-31, 1950-04-12, 1974-10-07, 1971-…
#> $ race_concept_id <int> 8516, 8527, 8527, 8527, 8516, 8527, 8527, …
#> $ ethnicity_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ location_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ provider_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ care_site_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ person_source_value <chr> "001f4a87-70d0-435c-a4b9-1425f6928d33", "0…
#> $ gender_source_value <chr> "F", "M", "M", "F", "F", "F", "F", "M", "F…
#> $ gender_source_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ race_source_value <chr> "black", "white", "white", "white", "black…
#> $ race_source_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ ethnicity_source_value <chr> "west_indian", "italian", "polish", "ameri…
#> $ ethnicity_source_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
You can then use dplyr to query the cdm tables just as you would an R
dataframe. The difference is that the data stays in the database and SQL
code is dynamically generated and set to the database backend. The goal
is to allow users to not think too much about the database or SQL and
instead use familiar R syntax to work with these large tables.
collect
will bring the data from the database into R. Be
careful not to request a gigantic result set! In general it is better to
aggregate data in the database, if possible, before bringing data into
R.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
cdm$person %>%
group_by(year_of_birth, gender_concept_id) %>%
summarize(n = n(), .groups = "drop") %>%
collect() %>%
mutate(sex = case_when(
gender_concept_id == 8532 ~ "Female",
gender_concept_id == 8507 ~ "Male"
)) %>%
ggplot(aes(y = n, x = year_of_birth, fill = sex)) +
geom_histogram(stat = "identity", position = "dodge") +
labs(x = "Year of birth",
y = "Person count",
title = "Age Distribution",
subtitle = cdm_name(cdm),
fill = NULL) +
theme_bw()
Joining tables
Since the OMOP CDM is a relational data model joins are very common in analytic code. All of the events in the OMOP CDM are recorded using integers representing standard “concepts”. To see the text description of a concept researchers need to join clinical tables to the concept vocabulary table. Every OMOP CDM should have a copy of the vocabulary used to map the data to the OMOP CDM format.
Here is an example query looking at the most common conditions in the CDM.
cdm$condition_occurrence %>%
count(condition_concept_id, sort = T) %>%
left_join(cdm$concept, by = c("condition_concept_id" = "concept_id")) %>%
collect() %>%
select("condition_concept_id", "concept_name", "n")
#> # A tibble: 80 × 3
#> condition_concept_id concept_name n
#> <int> <chr> <dbl>
#> 1 372328 Otitis media 3605
#> 2 140673 Hypothyroidism 31
#> 3 4278672 Fracture of forearm 569
#> 4 40479422 Injury of medial collateral ligament of knee 113
#> 5 4048695 Fracture of vertebral column without spinal cord … 23
#> 6 4027663 Peptic ulcer 802
#> 7 4310024 Angiodysplasia of stomach 388
#> 8 4280726 Seasonal allergic rhinitis 25
#> 9 192671 Gastrointestinal hemorrhage 479
#> 10 257012 Chronic sinusitis 825
#> # ℹ 70 more rows
Let’s look at the most common drugs used by patients with “Acute viral pharyngitis”.
cdm$condition_occurrence %>%
filter(condition_concept_id == 4112343) %>%
distinct(person_id) %>%
inner_join(cdm$drug_exposure, by = "person_id") %>%
count(drug_concept_id, sort = TRUE) %>%
left_join(cdm$concept, by = c("drug_concept_id" = "concept_id")) %>%
collect() %>%
select("concept_name", "n")
#> # A tibble: 113 × 2
#> concept_name n
#> <chr> <dbl>
#> 1 {7 (Inert Ingredients 1 MG Oral Tablet) / 21 (Mestranol 0.05 MG / Nore… 997
#> 2 Penicillin V Potassium 250 MG Oral Tablet 1666
#> 3 Methylphenidate Hydrochloride 20 MG Oral Tablet 63
#> 4 Haemophilus influenzae type b vaccine, PRP-OMP conjugate 1295
#> 5 Ibuprofen 100 MG Oral Tablet 360
#> 6 Amoxicillin 500 MG Oral Tablet 246
#> 7 Naproxen 500 MG Oral Tablet 18
#> 8 {24 (drospirenone 3 MG / Ethinyl Estradiol 0.02 MG Oral Tablet) / 4 (I… 6
#> 9 remifentanil 16
#> 10 Piperacillin 4000 MG / tazobactam 500 MG Injection 35
#> # ℹ 103 more rows
To inspect the generated SQL use show_query
from
dplyr.
cdm$condition_occurrence %>%
filter(condition_concept_id == 4112343) %>%
distinct(person_id) %>%
inner_join(cdm$drug_exposure, by = "person_id") %>%
count(drug_concept_id, sort = TRUE) %>%
left_join(cdm$concept, by = c("drug_concept_id" = "concept_id")) %>%
show_query()
#> <SQL>
#> SELECT
#> LHS.*,
#> concept_name,
#> domain_id,
#> vocabulary_id,
#> concept_class_id,
#> standard_concept,
#> concept_code,
#> valid_start_date,
#> valid_end_date,
#> invalid_reason
#> FROM (
#> SELECT drug_concept_id, COUNT(*) AS n
#> FROM (
#> SELECT
#> LHS.person_id AS person_id,
#> drug_exposure_id,
#> drug_concept_id,
#> drug_exposure_start_date,
#> drug_exposure_start_datetime,
#> drug_exposure_end_date,
#> drug_exposure_end_datetime,
#> verbatim_end_date,
#> drug_type_concept_id,
#> stop_reason,
#> refills,
#> quantity,
#> days_supply,
#> sig,
#> route_concept_id,
#> lot_number,
#> provider_id,
#> visit_occurrence_id,
#> visit_detail_id,
#> drug_source_value,
#> drug_source_concept_id,
#> route_source_value,
#> dose_unit_source_value
#> FROM (
#> SELECT DISTINCT person_id
#> FROM main.condition_occurrence
#> WHERE (condition_concept_id = 4112343.0)
#> ) LHS
#> INNER JOIN main.drug_exposure
#> ON (LHS.person_id = drug_exposure.person_id)
#> ) q01
#> GROUP BY drug_concept_id
#> ) LHS
#> LEFT JOIN main.concept
#> ON (LHS.drug_concept_id = concept.concept_id)
These are a few simple queries. More complex queries can be built by combining simple queries like the ones above and other analytic packages provide functions that implement common analytic use cases.
For example a “cohort definition” is a set of criteria that persons must satisfy that can be quite complex. The “Working with Cohorts” vignette describes creating and using cohorts with CDMConnector.
Saving query results to the database
Sometime it is helpful to save query results to the database instead
of reading the result into R. dplyr provides the compute
function but due to differences between database systems CDMConnector
has needed to export a wrapper that handles the slight differences.
CDMConnector’s compute_query
function works similarly but
is tested across the OHDSI supported database platforms. It also has a
few more options than dplyr::compute.
If we are writing data to the CDM database we need to add one more argument when creating our cdm reference object, the “write_schema”. This is a schema in the database where you have write permissions. Typically this should be a separate schema from the “cdm_schema”.
DBI::dbExecute(con, "create schema scratch;")
#> [1] 0
cdm <- cdm_from_con(con, cdm_schema = "main", write_schema = "scratch")
drugs <- cdm$condition_occurrence %>%
filter(condition_concept_id == 4112343) %>%
distinct(person_id) %>%
inner_join(cdm$drug_exposure, by = "person_id") %>%
count(drug_concept_id, sort = TRUE) %>%
left_join(cdm$concept, by = c("drug_concept_id" = "concept_id")) %>%
compute_query(name = "test",
temporary = FALSE,
schema = "scratch",
overwrite = TRUE)
drugs %>% show_query()
#> <SQL>
#> SELECT *
#> FROM scratch.test
drugs
#> # Source: table<test> [?? x 11]
#> # Database: DuckDB v0.9.2 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpjoA8Eb/file9aaa2292c32a.duckdb]
#> drug_concept_id n concept_name domain_id vocabulary_id concept_class_id
#> <int> <dbl> <chr> <chr> <chr> <chr>
#> 1 19133905 1060 Penicillin V … Drug RxNorm Clinical Drug
#> 2 40173590 129 Alendronic ac… Drug RxNorm Clinical Drug
#> 3 40220960 210 alteplase 100… Drug RxNorm Clinical Drug
#> 4 19073183 188 Amoxicillin 2… Drug RxNorm Clinical Drug
#> 5 920300 779 Nitrofurantoi… Drug RxNorm Clinical Drug
#> 6 19126352 207 Nitroglycerin… Drug RxNorm Clinical Drug
#> 7 1332419 179 Amlodipine 5 … Drug RxNorm Clinical Drug
#> 8 1545959 57 atorvastatin … Drug RxNorm Clinical Drug
#> 9 45892894 26 24 HR Donepez… Drug RxNorm Quant Clinical …
#> 10 1519937 6 Etonogestrel … Drug RxNorm Clinical Drug
#> # ℹ more rows
#> # ℹ 5 more variables: standard_concept <chr>, concept_code <chr>,
#> # valid_start_date <date>, valid_end_date <date>, invalid_reason <chr>
We can see that the query has been saved to a new table in the
scratch schema. compute_query
returns a dplyr reference to
this table.
Selecting a subset of CDM tables
If you do not need references to all tables you can easily select
only a subset of tables to include in the CDM reference. The
cdm_select_tbl
function supports the tidyselect
selection language and provides a new selection helper:
tbl_group
.
cdm_from_con(con, cdm_schema = "main") %>% cdm_select_tbl("person", "observation_period") # quoted names
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period
cdm_from_con(con, cdm_schema = "main") %>% cdm_select_tbl(person, observation_period) # unquoted names
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period
cdm_from_con(con, cdm_schema = "main") %>% cdm_select_tbl(starts_with("concept")) # tables that start with 'concept'
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: concept, concept_class, concept_relationship, concept_synonym, concept_ancestor
cdm_from_con(con, cdm_schema = "main") %>% cdm_select_tbl(contains("era")) # tables that contain the substring 'era'
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: drug_era, dose_era, condition_era
cdm_from_con(con, cdm_schema = "main") %>% cdm_select_tbl(matches("person|period")) # regular expression
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, payer_plan_period
Predefined sets of tables can also be selected using
tbl_group
which supports several subsets of the CDM: “all”,
“clinical”, “vocab”, “derived”, and “default”.
# pre-defined groups
cdm_from_con(con, "main") %>% cdm_select_tbl(tbl_group("clinical"))
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship
cdm_from_con(con, "main") %>% cdm_select_tbl(tbl_group("vocab"))
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map, drug_strength
The default set of CDM tables included in a CDM object is:
tbl_group("default")
#> [1] "person" "observation_period" "visit_occurrence"
#> [4] "condition_occurrence" "drug_exposure" "procedure_occurrence"
#> [7] "measurement" "observation" "death"
#> [10] "location" "care_site" "provider"
#> [13] "drug_era" "dose_era" "condition_era"
#> [16] "cdm_source" "concept" "vocabulary"
#> [19] "concept_relationship" "concept_synonym" "concept_ancestor"
#> [22] "drug_strength"
Subsetting a CDM
Sometimes it is helpful to subset a CDM to a specific set of persons or simply down sample the data to a more reasonable size. Let’s subset our cdm to just persons with a Pneumonia (concept_id 255848). This works best then the number of persons in the subset is quite small and the database has indexes on the “person_id” columns of each table.
person_ids <- cdm$condition_occurrence %>%
filter(condition_concept_id == 255848) %>%
distinct(person_id) %>%
pull(person_id)
length(person_ids)
#> [1] 52
cdm_pneumonia <- cdm %>%
cdm_subset(person_id = person_ids)
tally(cdm_pneumonia$person) %>%
pull(n)
#> [1] 52
cdm_pneumonia$condition_occurrence %>%
distinct(person_id) %>%
tally() %>%
pull(n)
#> [1] 52
Alternatively if we simply want a random sample of the entire CDM we
can use cdm_sample
.
cdm_100person <- cdm_sample(cdm, n = 100)
tally(cdm_100person$person) %>% pull("n")
#> [1] 100
Flatten a CDM
An OMOP CDM is a relational data model. Sometimes it is helpful to flatten this relational structure into a “tidy” dataframe with one row per observation. This transformation should only be done with a small number of persons and events.
cdm_flatten(cdm_pneumonia,
domain = c("condition", "drug", "measurement")) %>%
collect()
#> # A tibble: 3,892 × 8
#> person_id observation_concept_id start_date end_date type_concept_id domain
#> <int> <int> <date> <date> <int> <chr>
#> 1 2 3011505 1990-10-02 1990-10-02 5001 measu…
#> 2 3614 3011505 2009-12-06 2009-12-06 5001 measu…
#> 3 334 3006322 1974-09-05 1974-09-05 5001 measu…
#> 4 757 4024958 1925-04-27 1925-04-27 5001 measu…
#> 5 2333 3026361 1978-05-19 1978-05-19 5001 measu…
#> 6 1162 3006322 1959-11-08 1959-11-08 5001 measu…
#> 7 1162 3051031 1963-02-18 1963-02-18 5001 measu…
#> 8 757 3006322 1921-08-05 1921-08-05 5001 measu…
#> 9 2 4133840 1998-07-14 1998-07-14 5001 measu…
#> 10 419 3051031 1913-07-24 1913-07-24 5001 measu…
#> # ℹ 3,882 more rows
#> # ℹ 2 more variables: observation_concept_name <chr>, type_concept_name <chr>
Saving a local copy of a CDM
We can use collect
to bring the whole cdm object into R
as dataframes. If you would like to save a subset of the CDM and then
restore it in R as a local CDM object, CDMConnector provides the
stow
and cdm_from_files
functions to do
this.
local_cdm <- cdm_100person %>%
collect()
# The cdm tables are now dataframes
local_cdm$person[1:4, 1:4]
#> # A tibble: 4 × 4
#> person_id gender_concept_id year_of_birth month_of_birth
#> <int> <int> <int> <int>
#> 1 72 8532 1947 7
#> 2 154 8532 1974 11
#> 3 64 8532 1974 10
#> 4 286 8532 1928 5
save_path <- file.path(tempdir(), "tmp")
dir.create(save_path)
cdm %>%
stow(path = save_path, format = "parquet")
list.files(save_path)
Restore a saved cdm object from files with
cdm_from_files
.
cdm <- cdm_from_files(save_path, cdm_name = "GI Bleed example data")
Closing connections
Close the database connection with dbDisconnect
. After a
connection is closed any cdm objects created with that connection can no
longer be used.
DBI::dbDisconnect(con, shutdown = TRUE)