Calculating prevalence
a04_Calculating_prevalence.Rmd
Introduction
Prevalence is the total number of people with an ongoing health-related event, such as a medical condition or medication use, at a particular time or during a given period divided by the population at risk. In the previous vignettes we have seen how we can identify a denominator population and define and instantiate an outcome cohort. Prevalence then can be calculated to describe the proportion of people in the denominator population who are in the outcome cohort at a specified time point (point prevalence) or over a given time interval (period prevalence).
In the first plot below, we can We can see at time t+2 that 2 out of 5 people were in an outcome cohort, giving a point prevalence of 40%. In the second figure, period prevalence between t+2 and t+3 was also 40%. However for period prevalence between t and t+1, what do we do with those people who only contributed some time during the period? If we included them we´ll have a period prevalence of 20%, whereas if we require that everyone is observed for the full period to contribute then we´ll have a period prevalence of 33%.
Outcome definition
Outcome cohorts are defined externally. When creating outcome cohorts for estimating prevalence, important considerations relate to whether to restrict events to the first occurrence in an individuals history or not and how cohort exit is defined. These decisions will necessarily be based on the nature of the proposed outcome (e.g., whether it is an acute or chronic condition) and the research question being investigated.
In addition, it is typically not recommended to include exclusion
requirements when creating outcome cohorts. Restrictions on patient
characteristics can be specified when identifying the denominator cohort
using generateDenominatorCohortSet()
or
generateTargetDenominatorCohortSet()
Using estimatePointPrevalence() and estimatePeriodPrevalence()
estimatePointPrevalence()
and
estimatePeriodPrevalence()
are the functions we use to
estimate prevalence. To demonstrate its use, let´s load the
IncidencePrevalence package (along with a couple of packages to help for
subsequent plots) and generate 50,000 example patients using the
mockIncidencePrevalence()
function from whom we´ll create a
denominator population without adding any restrictions other than a
study period.
library(IncidencePrevalence)
library(dplyr)
library(tidyr)
cdm <- mockIncidencePrevalence(
sampleSize = 20000,
earliestObservationStartDate = as.Date("1960-01-01"),
minOutcomeDays = 365,
outPre = 0.3
)
cdm <- generateDenominatorCohortSet(
cdm = cdm, name = "denominator",
cohortDateRange = c(as.Date("1990-01-01"), as.Date("2009-12-31")),
ageGroup = list(c(0, 150)),
sex = "Both",
daysPriorObservation = 0
)
cdm$denominator %>%
glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.1.3-dev165 [unknown@Linux 6.5.0-1025-azure:R 4.4.2/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 1, 3, 10, 12, 14, 16, 18, 19, 20, 22, 23, 25, 27,…
#> $ cohort_start_date <date> 1991-08-05, 2007-02-16, 2009-04-21, 1992-01-28, …
#> $ cohort_end_date <date> 1996-01-10, 2007-12-09, 2009-12-31, 1999-08-18, …
Using estimatePointPrevalence()
Let´s first calculate point prevalence on a yearly basis.
prev <- estimatePointPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Years"
)
prev %>%
glimpse()
#> Rows: 144
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2147", "82", "0.0381928272007452", "0.03087585700949…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-01-01 &&& 1990-01-01", "1990-01-01 &&& 1990-01-…
plotPrevalence(prev, ylim = c(0, NA))
We can also calculate point prevalence by calendar month.
prev <- estimatePointPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Months"
)
prev %>%
glimpse()
#> Rows: 1,244
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2147", "82", "0.0381928272007452", "0.03087585700949…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-01-01 &&& 1990-01-01", "1990-01-01 &&& 1990-01-…
plotPrevalence(prev, ylim = c(0, NA))
By using the estimatePointPrevalence() function, we can further specify where to compute point prevalence in each time interval (start, middle, end). By default, this parameter is set to start. But we can use middle instead like so:
prev <- estimatePointPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Years",
timePoint = "middle"
)
prev %>%
glimpse()
#> Rows: 144
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2160", "76", "0.0351851851851852", "0.02820349824624…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-07-01 &&& 1990-07-01", "1990-07-01 &&& 1990-07-…
plotPrevalence(prev, ylim = c(0, NA), options = list(line = FALSE))
Using estimatePeriodPrevalence()
To calculate period prevalence by year (i.e. each period is a calendar year)
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Years"
)
prev %>%
glimpse()
#> Rows: 144
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2523", "189", "0.074910820451843", "0.06527258851403…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-01-01 &&& 1990-12-31", "1990-01-01 &&& 1990-12-…
plotPrevalence(prev)
To calculate period prevalence by month (i.e. each period is a calendar month)
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Months"
)
prev %>%
glimpse()
#> Rows: 1,244
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2176", "94", "0.0431985294117647", "0.03543109553945…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-01-01 &&& 1990-01-31", "1990-01-01 &&& 1990-01-…
plotPrevalence(prev, ylim = c(0, NA))
When using the estimatePeriodPrevalence() function, we can set the fullContribution parameter to decide whether individuals are required to be present in the database throughout the interval of interest in order to be included (fullContribution=TRUE). If not, individuals will only be required to be present for one day of the interval to contribute (fullContribution=FALSE), which would be specified like so:
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Months",
fullContribution = FALSE
)
prev %>%
glimpse()
#> Rows: 1,244
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "denominator_cohort_name &&& outcome_cohort_name", "d…
#> $ group_level <chr> "denominator_cohort_1 &&& cohort_1", "denominator_coh…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Denominator", "Outcome", "Outcome", "Outcome", "Outc…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "denominator_count", "outcome_count", "prevalence", "…
#> $ estimate_type <chr> "integer", "integer", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "2176", "94", "0.0431985294117647", "0.03543109553945…
#> $ additional_name <chr> "prevalence_start_date &&& prevalence_end_date", "pre…
#> $ additional_level <chr> "1990-01-01 &&& 1990-01-31", "1990-01-01 &&& 1990-01-…
plotPrevalence(prev)
Stratified analyses
If we specified multiple denominator populations results will be returned for each. Here for example we define three age groups for denominator populations and get three sets of estimates back when estimating prevalence.
cdm <- generateDenominatorCohortSet(
cdm = cdm, name = "denominator_age_sex",
cohortDateRange = c(as.Date("1990-01-01"), as.Date("2009-12-31")),
ageGroup = list(c(0, 39),
c(41, 65),
c(66, 150)),
sex = "Both",
daysPriorObservation = 0
)
#> ℹ Creating denominator cohorts
#> ✔ Cohorts created in 0 min and 5 sec
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator_age_sex",
outcomeTable = "outcome"
)
#> Getting prevalence for analysis 1 of 3
#> Getting prevalence for analysis 2 of 3
#> Getting prevalence for analysis 3 of 3
#> Time taken: 0 mins and 1 secs
plotPrevalence(prev, facet = "denominator_age_group")
While we specify time-varying stratifications when defining our denominator populations, if we have time-invariant stratifications we can include these at the the estimation stage. To do this we first need to add a new column to our denominator cohort with our stratification variable. Here we´ll add an example stratification just to show the idea. Note, as well as getting stratified results we´ll also get overall results.
cdm$denominator <- cdm$denominator %>%
mutate(group = if_else(as.numeric(subject_id) < 500, "first", "second"))
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
strata = "group"
)
#> Getting prevalence for analysis 1 of 1
#> Time taken: 0 mins and 1 secs
plotPrevalence(prev,
colour = c("group"))
We can also stratify on multiple variables at the same time.
cdm$denominator <- cdm$denominator %>%
mutate(group_1 = if_else(as.numeric(subject_id) < 1500, "first", "second"),
group_2 = if_else(as.numeric(subject_id) < 1000, "one", "two"))
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
strata = list(c("group_1"), # for just group_1
c("group_2"), # for just group_2
c("group_1", "group_2")) # for group_1 and group_2
)
#> Getting prevalence for analysis 1 of 1
#> Time taken: 0 mins and 1 secs
plotPrevalence(prev,
facet = c("group_1", "group_2"))
Other parameters
In the examples above, we have used calculated prevalence by months and years, but it can be also calculated by weeks, months or for the entire time period observed (overall). In addition, the user can decide whether to include time intervals that are not fully captured in the database (e.g., having data up to June for the last study year when computing period prevalence rates). By default, incidence will only be estimated for those intervals where the database captures all the interval (completeDatabaseIntervals=TRUE).
Given that we can set estimatePointPrevalence()
and
estimatePeriorPrevalence()
to exclude individuals based on
certain parameters (e.g., fullContribution), it is important to note
that the denominator population used to compute prevalence rates might
differ from the one calculated with
generateDenominatorCohortSet()
. Along with our central
estimate, 95 % confidence intervals are calculated using the Wilson
Score method.
Attrition
estimatePointPrevalence()
and
estimatePeriorPrevalence()
will generate a table with point
and period prevalence rates for each of the time intervals studied and
for each combination of the parameters set, respectively. We can also
view attrition associated with performing the analysis.
prev <- estimatePeriodPrevalence(
cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "Years",
fullContribution = TRUE
)
tablePrevalenceAttrition(prev, settingsColumns = character())
Denominator cohort name | Outcome cohort name | Reason | Reason id |
Variable name
|
|||
---|---|---|---|---|---|---|---|
Number records | Number subjects | Excluded records | Excluded subjects | ||||
mock; - | |||||||
denominator_cohort_1 | cohort_1 | Starting population | 1 | 20,000 | 20,000 | - | - |
Missing year of birth | 2 | 20,000 | 20,000 | 0 | 0 | ||
Missing sex | 3 | 20,000 | 20,000 | 0 | 0 | ||
Cannot satisfy age criteria during the study period based on year of birth | 4 | 20,000 | 20,000 | 0 | 0 | ||
No observation time available during study period | 5 | 10,399 | 10,399 | 9,601 | 9,601 | ||
Doesn't satisfy age criteria during the study period | 6 | 10,399 | 10,399 | 0 | 0 | ||
Prior history requirement not fulfilled during study period | 7 | 10,399 | 10,399 | 0 | 0 | ||
No observation time available after applying age, prior observation and, if applicable, target criteria | 10 | 10,149 | 10,149 | 250 | 250 | ||
Starting analysis population | 11 | 10,149 | 10,149 | - | - | ||
Not observed during the complete database interval | 12 | 10,149 | 10,149 | 0 | 0 | ||
Do not satisfy full contribution requirement for an interval | 13 | 8,385 | 8,385 | 1,764 | 1,764 |