Introduction
In previous vignettes we have seen how to add patient level demographics (age, sex, prior observation, …) or intersections with cohorts , concepts and tables.
Once we have added several columns to our table of interest we may
want to summarise all this data into a summarised_result
object using several different estimates.
Variables type
We support different types of variables, variable type is assigned
using dplyr::type_sum
:
Date:
date
ordttm
.Numeric:
dbl
ordrtn
.Integer:
int
orint64
.Categorical:
chr
,fct
orord
.Logical:
lgl
.
Estimates names
We can summarise this data using different estimates:
Estimate name | Description | Estimate type |
---|---|---|
date | ||
mean | mean of the variable of interest. | date |
sd | standard deviation of the variable of interest. | date |
median | median of the variable of interest. | date |
qXX | qualtile of XX% the variable of interest. | date |
min | minimum of the variable of interest. | date |
max | maximum of the variable of interest. | date |
count_missing | number of missing values. | integer |
percentage_missing | percentage of missing values | percentage |
density | density distribution | multiple |
numeric | ||
sum | sum of all the values for the variable of interest. | numeric |
mean | mean of the variable of interest. | numeric |
sd | standard deviation of the variable of interest. | numeric |
median | median of the variable of interest. | numeric |
qXX | qualtile of XX% the variable of interest. | numeric |
min | minimum of the variable of interest. | numeric |
max | maximum of the variable of interest. | numeric |
count_missing | number of missing values. | integer |
percentage_missing | percentage of missing values | percentage |
count | count number of `1`. | integer |
percentage | percentage of occurrences of `1` (NA are excluded). | percentage |
density | density distribution | multiple |
integer | ||
sum | sum of all the values for the variable of interest. | integer |
mean | mean of the variable of interest. | numeric |
sd | standard deviation of the variable of interest. | numeric |
median | median of the variable of interest. | integer |
qXX | qualtile of XX% the variable of interest. | integer |
min | minimum of the variable of interest. | integer |
max | maximum of the variable of interest. | integer |
count_missing | number of missing values. | integer |
percentage_missing | percentage of missing values | percentage |
count | count number of `1`. | integer |
percentage | percentage of occurrences of `1` (NA are excluded). | percentage |
density | density distribution | multiple |
categorical | ||
count | number of times that each category is observed. | integer |
percentage | percentage of individuals with that category. | percentage |
logical | ||
count | count number of `TRUE`. | integer |
percentage | percentage of occurrences of `TRUE` (NA are excluded). | percentage |
Summarise our first table
Lets get started creating our data that we are going to summarise:
#>
#> Download completed!
library(duckdb)
#> Loading required package: DBI
library(CDMConnector)
library(PatientProfiles)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(CodelistGenerator)
cdm <- cdmFromCon(
con = dbConnect(duckdb(), eunomia_dir()),
cdmSchema = "main",
writeSchema = "main"
)
#> Creating CDM database /tmp/RtmpKm8BZp/GiBleed_5.3.zip
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#> target signature 'duckdb_connection#Id'.
#> "duckdb_connection#ANY" would also be valid
cdm <- generateConceptCohortSet(
cdm = cdm,
conceptSet = list("sinusitis" = c(4294548, 4283893, 40481087, 257012)),
limit = "first",
name = "my_cohort"
)
#> Warning: ! 3 casted column in my_cohort (cohort_attrition) as do not match expected
#> column type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in my_cohort (cohort_codelist) as do not match expected
#> column type:
#> • `concept_id` from numeric to integer
cdm <- generateConceptCohortSet(
cdm = cdm,
conceptSet = getDrugIngredientCodes(cdm = cdm, name = c("morphine", "aspirin", "oxycodone")),
name = "drugs"
)
#> Warning: ! 3 casted column in drugs (cohort_attrition) as do not match expected column
#> type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
x <- cdm$my_cohort |>
# add demographics variables
addDemographics() |>
# add number of counts per ingredient before and after index date
addCohortIntersectCount(
targetCohortTable = "drugs",
window = list("prior" = c(-Inf, -1), "future" = c(1, Inf)),
nameStyle = "{window_name}_{cohort_name}"
) |>
# add a flag regarding if they had a prior occurrence of pharyngitis
addConceptIntersectFlag(
conceptSet = list(pharyngitis = 4112343),
window = c(-Inf, -1),
nameStyle = "pharyngitis_before"
) |>
# date fo the first visit for that individual
addTableIntersectDate(
tableName = "visit_occurrence",
window = c(-Inf, Inf),
nameStyle = "first_visit"
) |>
# time till the next visit after sinusitis
addTableIntersectDays(
tableName = "visit_occurrence",
window = c(1, Inf),
nameStyle = "days_to_next_visit"
)
#> Warning: ! `codelist` contains numeric values, they are casted
#> to integers.
x |>
glimpse()
#> Rows: ??
#> Columns: 17
#> Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1//tmp/RtmpKm8BZp/file1c45e3961ea.duckdb]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ subject_id <int> 5199, 2392, 3199, 5234, 890, 1174, 3540, 2526, 5…
#> $ cohort_start_date <date> 1978-04-28, 1922-11-06, 1979-03-07, 1954-08-05,…
#> $ cohort_end_date <date> 2018-08-14, 2019-03-13, 2017-10-18, 2006-07-26,…
#> $ age <int> 16, 6, 6, 17, 8, 12, 5, 8, 8, 4, 7, 10, 23, 18, …
#> $ sex <chr> "Female", "Female", "Male", "Female", "Male", "F…
#> $ prior_observation <int> 5946, 2467, 2226, 6546, 3128, 4536, 2165, 3008, …
#> $ future_observation <int> 14718, 35191, 14105, 18983, 18565, 24486, 14407,…
#> $ prior_7052_morphine <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ future_7804_oxycodone <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ prior_1191_aspirin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ future_1191_aspirin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ prior_7804_oxycodone <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ future_7052_morphine <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ pharyngitis_before <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ first_visit <date> 2001-04-19, 1976-04-02, 2006-01-25, 1977-08-24,…
#> $ days_to_next_visit <dbl> 8392, 19506, 9821, 8420, 10918, 3913, 12259, 485…
In this table (x
) we have a cohort of first occurrences
of sinusitis, and then we added: demographics; the counts of 3
ingredients, any time prior and any time after the index date; a flag
indicating if they had pharyngitis before; date of the first visit; and,
finally, time to next visit.
If we want to summarise the age stratified by sex we could use tidyverse functions like:
x |>
group_by(sex) |>
summarise(mean_age = mean(age), sd_age = sd(age))
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source: SQL [2 x 3]
#> # Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1//tmp/RtmpKm8BZp/file1c45e3961ea.duckdb]
#> sex mean_age sd_age
#> <chr> <dbl> <dbl>
#> 1 Female 7.51 7.46
#> 2 Male 7.72 8.14
This would give us a first insight of the differences of age. But the output is not going to be in an standardised format.
In PatientProfiles we have built a function that:
Allow you to get the standardised output.
You have a wide range of estimates that you can get.
You don’t have to worry which of the functions are supported in the database side (e.g. not all dbms support quantile function).
For example we could get the same information like before using:
x |>
summariseResult(
strata = "sex",
variables = "age",
estimates = c("mean", "sd"),
counts = FALSE
) |>
select(strata_name, strata_level, variable_name, estimate_value)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:36.072814
#>
#> ✔ Summary finished, at 2024-10-29 16:39:36.406342
#> # A tibble: 6 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 overall overall age 7.61212346597248
#> 2 overall overall age 7.79743654397838
#> 3 sex Female age 7.5127644055434
#> 4 sex Male age 7.7154779969651
#> 5 sex Female age 7.45793686970357
#> 6 sex Male age 8.13712697581575
You can stratify the results also by “pharyngitis_before”:
x |>
summariseResult(
strata = list("sex", "pharyngitis_before"),
variables = "age",
estimates = c("mean", "sd"),
counts = FALSE
) |>
select(strata_name, strata_level, variable_name, estimate_value)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:36.981242
#>
#> ✔ Summary finished, at 2024-10-29 16:39:37.53058
#> # A tibble: 10 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 overall overall age 7.61212346597248
#> 2 overall overall age 7.79743654397838
#> 3 sex Female age 7.5127644055434
#> 4 sex Male age 7.7154779969651
#> 5 sex Female age 7.45793686970357
#> 6 sex Male age 8.13712697581575
#> 7 pharyngitis_before 0 age 4.95620875824835
#> 8 pharyngitis_before 1 age 11.9442270058708
#> 9 pharyngitis_before 0 age 5.61220818358422
#> 10 pharyngitis_before 1 age 8.85279666294611
Note that the interaction term was not included, if we want to include it we have to specify it as follows:
x |>
summariseResult(
strata = list("sex", "pharyngitis_before", c("sex", "pharyngitis_before")),
variables = "age",
estimates = c("mean", "sd"),
counts = FALSE
) |>
select(strata_name, strata_level, variable_name, estimate_value) |>
print(n = Inf)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:38.101275
#>
#> ✔ Summary finished, at 2024-10-29 16:39:38.872215
#> # A tibble: 18 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 overall overall age 7.61212346597248
#> 2 overall overall age 7.79743654397838
#> 3 sex Female age 7.5127644055434
#> 4 sex Male age 7.7154779969651
#> 5 sex Female age 7.45793686970357
#> 6 sex Male age 8.13712697581575
#> 7 pharyngitis_before 0 age 4.95620875824835
#> 8 pharyngitis_before 1 age 11.9442270058708
#> 9 pharyngitis_before 0 age 5.61220818358422
#> 10 pharyngitis_before 1 age 8.85279666294611
#> 11 sex &&& pharyngitis_before Female &&& 0 age 4.97596153846154
#> 12 sex &&& pharyngitis_before Female &&& 1 age 11.4285714285714
#> 13 sex &&& pharyngitis_before Male &&& 0 age 4.93652694610778
#> 14 sex &&& pharyngitis_before Male &&& 1 age 12.51966873706
#> 15 sex &&& pharyngitis_before Female &&& 0 age 5.61023639284453
#> 16 sex &&& pharyngitis_before Female &&& 1 age 8.22838499965833
#> 17 sex &&& pharyngitis_before Male &&& 0 age 5.61746547644658
#> 18 sex &&& pharyngitis_before Male &&& 1 age 9.47682947960302
You can remove overall strata with the includeOverallStrata option:
x |>
summariseResult(
includeOverallStrata = FALSE,
strata = list("sex", "pharyngitis_before"),
variables = "age",
estimates = c("mean", "sd"),
counts = FALSE
) |>
select(strata_name, strata_level, variable_name, estimate_value) |>
print(n = Inf)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:39.441672
#>
#> ✔ Summary finished, at 2024-10-29 16:39:39.88409
#> # A tibble: 8 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 sex Female age 7.5127644055434
#> 2 sex Male age 7.7154779969651
#> 3 sex Female age 7.45793686970357
#> 4 sex Male age 8.13712697581575
#> 5 pharyngitis_before 0 age 4.95620875824835
#> 6 pharyngitis_before 1 age 11.9442270058708
#> 7 pharyngitis_before 0 age 5.61220818358422
#> 8 pharyngitis_before 1 age 8.85279666294611
The results model has two levels of grouping (group and strata), you can specify them independently:
x |>
addCohortName() |>
summariseResult(
group = "cohort_name",
includeOverallGroup = FALSE,
strata = list("sex", "pharyngitis_before"),
includeOverallStrata = TRUE,
variables = "age",
estimates = c("mean", "sd"),
counts = FALSE
) |>
select(group_name, group_level, strata_name, strata_level, variable_name, estimate_value) |>
print(n = Inf)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:40.644488
#>
#> ✔ Summary finished, at 2024-10-29 16:39:41.380252
#> # A tibble: 10 × 6
#> group_name group_level strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 cohort_name sinusitis overall overall age 7.61212346597…
#> 2 cohort_name sinusitis overall overall age 7.79743654397…
#> 3 cohort_name sinusitis sex Female age 7.51276440554…
#> 4 cohort_name sinusitis sex Male age 7.71547799696…
#> 5 cohort_name sinusitis sex Female age 7.45793686970…
#> 6 cohort_name sinusitis sex Male age 8.13712697581…
#> 7 cohort_name sinusitis pharyngiti… 0 age 4.95620875824…
#> 8 cohort_name sinusitis pharyngiti… 1 age 11.9442270058…
#> 9 cohort_name sinusitis pharyngiti… 0 age 5.61220818358…
#> 10 cohort_name sinusitis pharyngiti… 1 age 8.85279666294…
We can add or remove number subjects and records (if a person identifier is found) counts with the counts parameter:
x |>
summariseResult(
variables = "age",
estimates = c("mean", "sd"),
counts = TRUE
) |>
select(strata_name, strata_level, variable_name, estimate_value) |>
print(n = Inf)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> → Start summary of data, at 2024-10-29 16:39:41.948579
#>
#> ✔ Summary finished, at 2024-10-29 16:39:42.155825
#> # A tibble: 4 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 overall overall number records 2689
#> 2 overall overall number subjects 2689
#> 3 overall overall age 7.61212346597248
#> 4 overall overall age 7.79743654397838
If you want to specify different groups of estimates per different groups of variables you can use lists:
x |>
summariseResult(
strata = "pharyngitis_before",
includeOverallStrata = FALSE,
variables = list(c("age", "prior_observation"), "sex"),
estimates = list(c("mean", "sd"), c("count", "percentage")),
counts = FALSE
) |>
select(strata_name, strata_level, variable_name, estimate_value) |>
print(n = Inf)
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> • prior_observation: mean, sd
#> • sex: count, percentage
#> → Start summary of data, at 2024-10-29 16:39:42.731705
#>
#> ✔ Summary finished, at 2024-10-29 16:39:43.102818
#> # A tibble: 16 × 4
#> strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr>
#> 1 pharyngitis_before 0 age 4.95620875824835
#> 2 pharyngitis_before 1 age 11.9442270058708
#> 3 pharyngitis_before 0 age 5.61220818358422
#> 4 pharyngitis_before 1 age 8.85279666294611
#> 5 pharyngitis_before 0 sex 832
#> 6 pharyngitis_before 1 sex 539
#> 7 pharyngitis_before 0 sex 835
#> 8 pharyngitis_before 1 sex 483
#> 9 pharyngitis_before 0 sex 49.9100179964007
#> 10 pharyngitis_before 1 sex 52.7397260273973
#> 11 pharyngitis_before 0 sex 50.0899820035993
#> 12 pharyngitis_before 1 sex 47.2602739726027
#> 13 pharyngitis_before 0 prior_observation 1986.83443311338
#> 14 pharyngitis_before 1 prior_observation 4542.85812133072
#> 15 pharyngitis_before 0 prior_observation 2053.24325390978
#> 16 pharyngitis_before 1 prior_observation 3228.06460521219
An example of a complete analysis would be:
drugs <- settings(cdm$drugs)$cohort_name
x |>
addCohortName() |>
summariseResult(
group = "cohort_name",
includeOverallGroup = FALSE,
strata = list("pharyngitis_before"),
includeOverallStrata = TRUE,
variables = list(
c(
"age", "prior_observation", "future_observation", paste0("prior_", drugs),
paste0("future_", drugs), "days_to_next_visit"
),
c("sex", "pharyngitis_before"),
c("first_visit", "cohort_start_date", "cohort_end_date")
),
estimates = list(
c("median", "q25", "q75"),
c("count", "percentage"),
c("median", "q25", "q75", "min", "max")
),
counts = TRUE
) |>
select(group_name, group_level, strata_name, strata_level, variable_name, estimate_value)
#> ℹ The following estimates will be computed:
#> • age: median, q25, q75
#> • prior_observation: median, q25, q75
#> • future_observation: median, q25, q75
#> • prior_1191_aspirin: median, q25, q75
#> • prior_7052_morphine: median, q25, q75
#> • prior_7804_oxycodone: median, q25, q75
#> • future_1191_aspirin: median, q25, q75
#> • future_7052_morphine: median, q25, q75
#> • future_7804_oxycodone: median, q25, q75
#> • days_to_next_visit: median, q25, q75
#> • sex: count, percentage
#> • pharyngitis_before: count, percentage
#> • first_visit: median, q25, q75, min, max
#> • cohort_start_date: median, q25, q75, min, max
#> • cohort_end_date: median, q25, q75, min, max
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-29 16:39:43.955866
#>
#> ✔ Summary finished, at 2024-10-29 16:39:44.182054
#> # A tibble: 159 × 6
#> group_name group_level strata_name strata_level variable_name estimate_value
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 cohort_name sinusitis overall overall number recor… 2689
#> 2 cohort_name sinusitis overall overall number subje… 2689
#> 3 cohort_name sinusitis overall overall cohort_start… 1968-05-06
#> 4 cohort_name sinusitis overall overall cohort_start… 1956-07-05
#> 5 cohort_name sinusitis overall overall cohort_start… 1978-09-04
#> 6 cohort_name sinusitis overall overall cohort_start… 1908-10-30
#> 7 cohort_name sinusitis overall overall cohort_start… 2018-02-13
#> 8 cohort_name sinusitis overall overall cohort_end_d… 2018-12-14
#> 9 cohort_name sinusitis overall overall cohort_end_d… 2018-08-02
#> 10 cohort_name sinusitis overall overall cohort_end_d… 2019-04-06
#> # ℹ 149 more rows