Tidy your summarised result object
Source:vignettes/tidySummarisedResult.Rmd
tidySummarisedResult.Rmd
<summarised_result>
format
The <summarised_result>
format is a standard
output defined in omopgenerics.
The fact that it is standardised output make it a very powerful tool so
multiple functions can export on the same format and built
functionalities on top of it, as it can be seen in tables and plots
vignettes. This standard output it can be some times hard to manipulate
to do your custom analysis. visOmopResults
contains tools
to tidy your <summarised_result>
object that are covered in this vignette.
Tidy <summarised_result>
visOmopResults
defines the method tidy for
<summarised_result>
object, what this function does
is to:
1. Split group, strata, and additional pairs into separate columns:
The <summarised_result>
object has the following
pair columns: group_name-group_level, strata_name-strata_level, and
additional_name-additional_level. These pairs use the
&&&
separator to combine multiple fields, for
example if you want to combine cohort_name and age_group in
group_name-group_level pair:
group_name = "cohort_name &&& age_group"
and
group_level = "my_cohort &&& <40"
. By
default if no aggregation is produced in group_name-group_level pair:
group_name = "overall"
and
group_level = "overall"
.
ORIGINAL FORMAT:
group_name | group_level |
---|---|
cohort_name | acetaminophen |
cohort_name &&& sex | acetaminophen &&& Female |
sex &&& age_group | Male &&& <40 |
The tidy format puts each one of the values as a columns. Making it
easier to manipulate but at the same time the output is not standardised
anymore as each <summarised_result>
object will have
a different number and names of columns. Missing values will be filled
with the “overall” label.
TIDY FORMAT:
cohort_name | sex | age_group |
---|---|---|
acetaminophen | overall | overall |
acetaminophen | Female | overall |
overall | Male | <40 |
2. Add settings of the <summarised_result>
object
as columns:
Each <summarised_result>
object has a setting
attribute that relates the ‘result_id’ column with each different set of
settings. The columns ‘result_type’, ‘package_name’ and
‘package_version’ are always present in settings, but then we may have
some extra parameters depending how the object was created. So in the
<summarised_result>
format we need to use these
settings()
functions to see those variables:
ORIGINAL FORMAT:
settings
:
result_id | my_setting | package_name |
---|---|---|
1 | TRUE | visOmopResults |
2 | FALSE | visOmopResults |
<summarised_result>
:
result_id | cdm_name | additional_name | |
---|---|---|---|
1 | omop | ... | overall |
... | ... | ... | ... |
2 | omop | ... | overall |
... | ... | ... | ... |
But in the tidy format we add the settings as columns, making that
their value is repeated multiple times (there is only one row per
result_id in settings, whereas there can be multiple rows in the
<summarised_result>
object). The column ‘result_id’
is eliminated as it does not provide information anymore. Again we loose
on standardisation (multiple different settings), but we gain in
flexibility:
TIDY FORMAT:
cdm_name | additional_name | my_setting | package_name | |
---|---|---|---|---|
omop | ... | overall | TRUE | visOmopResults |
... | ... | ... | ... | ... |
omop | ... | overall | FALSE | visOmopResults |
... | ... | ... | ... | ... |
3. Pivot estimates as columns:
In the <summarised_result>
format estimates are
displayed in 3 columns:
- ‘estimate_name’ indicates the name of the estimate.
- ‘estimate_type’ indicates the type of the estimate (as all of them will be casted to character). Possible values are: numeric, integer, date, character, proportion, percentage, logical.
- ‘estimate_value’ value of the estimate as
<character>
.
ORIGINAL FORMAT:
variable_name | estimate_name | estimate_type | estimate_value |
---|---|---|---|
number individuals | count | integer | 100 |
age | mean | numeric | 50.3 |
age | sd | numeric | 20.7 |
In the tidy format we pivot the estimates, creating a new column for each one of the ‘estimate_name’ values. The columns will be casted to ‘estimate_type’. If there are multiple estimate_type(s) for same estimate_name they won’t be casted and they will be displayed as character (a warning will be thrown). Missing data are populated with NAs.
TIDY FORMAT:
variable_name | count | mean | sd |
---|---|---|---|
number individuals | 100 | NA | NA |
age | NA | 50.3 | 20.7 |
Example
Let’s see a simple example with some toy data:
library(visOmopResults)
result <- mockSummarisedResult()
result |>
tidy()
#> # A tibble: 72 × 13
#> cdm_name cohort_name age_group sex variable_name variable_level count
#> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 mock cohort1 overall overall number subjects NA 9200055
#> 2 mock cohort1 <40 Male number subjects NA 4007202
#> 3 mock cohort1 >=40 Male number subjects NA 2131727
#> 4 mock cohort1 <40 Female number subjects NA 6717668
#> 5 mock cohort1 >=40 Female number subjects NA 586141
#> 6 mock cohort1 overall Male number subjects NA 9970691
#> 7 mock cohort1 overall Female number subjects NA 1490355
#> 8 mock cohort1 <40 overall number subjects NA 5185566
#> 9 mock cohort1 >=40 overall number subjects NA 8461201
#> 10 mock cohort2 overall overall number subjects NA 7182697
#> # ℹ 62 more rows
#> # ℹ 6 more variables: mean <dbl>, sd <dbl>, percentage <dbl>,
#> # result_type <chr>, package_name <chr>, package_version <chr>
Customise your tidy summarised_result
We have several functions to customise the tidy version of the
<summarised_result>
object.
Split
The functions split are provided independent:
-
splitGroup()
only splits the pair group_name-group_level columns. -
splitStrata()
only splits the pair strata_name-strata_level columns. -
splitAdditional()
only splits the pair additional_name-additional_level columns.
There is also the function: - splitAll()
that splits any
pair x_name-x_level that is found on the data.
splitAll(result)
#> # A tibble: 126 × 10
#> result_id cdm_name cohort_name age_group sex variable_name variable_level
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 mock cohort1 overall overall number subje… NA
#> 2 1 mock cohort1 <40 Male number subje… NA
#> 3 1 mock cohort1 >=40 Male number subje… NA
#> 4 1 mock cohort1 <40 Female number subje… NA
#> 5 1 mock cohort1 >=40 Female number subje… NA
#> 6 1 mock cohort1 overall Male number subje… NA
#> 7 1 mock cohort1 overall Female number subje… NA
#> 8 1 mock cohort1 <40 overall number subje… NA
#> 9 1 mock cohort1 >=40 overall number subje… NA
#> 10 1 mock cohort2 overall overall number subje… NA
#> # ℹ 116 more rows
#> # ℹ 3 more variables: estimate_name <chr>, estimate_type <chr>,
#> # estimate_value <chr>
Pivot estimates
pivotEstimates()
can be used to pivot the variables that
we are interested in.
The argument pivotEstimatesBy
specifies which are the
variables that we want to use to pivot by, there are four options:
-
NULL/character()
to not pivot anything. -
c("estimate_name")
to pivot only estimate_name. -
c("variable_level", "estimate_name")
to pivot estimate_name and variable_level. -
c("variable_name", "variable_level", "estimate_name")
to pivot estimate_name, variable_level and variable_name.
Note that variable_level
can contain NA values, these
will be ignored on the naming part.
pivotEstimates(
result,
pivotEstimatesBy = c("variable_name","variable_level", "estimate_name")
)
#> # A tibble: 18 × 15
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 mock cohort_name cohort1 overall overall
#> 2 1 mock cohort_name cohort1 age_group &&& sex <40 &&& Male
#> 3 1 mock cohort_name cohort1 age_group &&& sex >=40 &&& Male
#> 4 1 mock cohort_name cohort1 age_group &&& sex <40 &&& Female
#> 5 1 mock cohort_name cohort1 age_group &&& sex >=40 &&& Female
#> 6 1 mock cohort_name cohort1 sex Male
#> 7 1 mock cohort_name cohort1 sex Female
#> 8 1 mock cohort_name cohort1 age_group <40
#> 9 1 mock cohort_name cohort1 age_group >=40
#> 10 1 mock cohort_name cohort2 overall overall
#> 11 1 mock cohort_name cohort2 age_group &&& sex <40 &&& Male
#> 12 1 mock cohort_name cohort2 age_group &&& sex >=40 &&& Male
#> 13 1 mock cohort_name cohort2 age_group &&& sex <40 &&& Female
#> 14 1 mock cohort_name cohort2 age_group &&& sex >=40 &&& Female
#> 15 1 mock cohort_name cohort2 sex Male
#> 16 1 mock cohort_name cohort2 sex Female
#> 17 1 mock cohort_name cohort2 age_group <40
#> 18 1 mock cohort_name cohort2 age_group >=40
#> # ℹ 9 more variables: additional_name <chr>, additional_level <chr>,
#> # `number subjects_count` <int>, age_mean <dbl>, age_sd <dbl>,
#> # Medications_Amoxiciline_count <int>,
#> # Medications_Amoxiciline_percentage <dbl>,
#> # Medications_Ibuprofen_count <int>, Medications_Ibuprofen_percentage <dbl>
Add settings
addSettings()
is used to add the settings that we want
as new columns to our <summarised_result>
object.
The settingsColumns
argument is used to choose which are
the settings we want to add.
addSettings(
result,
settingsColumns = "result_type"
)
#> # A tibble: 126 × 14
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 mock cohort_name cohort1 overall overall
#> 2 1 mock cohort_name cohort1 age_group &&& sex <40 &&& Male
#> 3 1 mock cohort_name cohort1 age_group &&& sex >=40 &&& Male
#> 4 1 mock cohort_name cohort1 age_group &&& sex <40 &&& Female
#> 5 1 mock cohort_name cohort1 age_group &&& sex >=40 &&& Female
#> 6 1 mock cohort_name cohort1 sex Male
#> 7 1 mock cohort_name cohort1 sex Female
#> 8 1 mock cohort_name cohort1 age_group <40
#> 9 1 mock cohort_name cohort1 age_group >=40
#> 10 1 mock cohort_name cohort2 overall overall
#> # ℹ 116 more rows
#> # ℹ 8 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>, result_type <chr>