Split and unite are complementary functions to manipulate dataframes in R. They work with summarised_results objects (see R package omopgenerics), but they can also support R dataframes from other classes.
summarised_result
First, let’s load relevant libraries and generate a mock summarised_result object to use in the following examples.
library(visOmopResults)
library(dplyr)
mock_sr <- mockSummarisedResult()
mock_sr |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
A summarised_result contains 3 types of name-level paired columns which are targeted by the set of unite and split functions. These are the group columns which typically can contain information about cohorts, strata columns which have data on stratification for each group, and finally the additional columns which include further information not covered by group and strata.
Split functions
The idea of the split functions is to pivot the “name” (e.g. group_name) column to split each value of that column into a column in the dataframe, which values are taken by the “level” (e.g. group_level) column.
splitGroup(), splitStrata(), and splitAdditional()
For instance, the splitGroup
function will target the
group_name-group_level columns as seen below.
mock_sr |> splitGroup() |> glimpse()
#> Rows: 126
#> Columns: 12
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ cohort_name <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
Similar to splitStrata
, the functions
splitGroup
will split group_name and
group_level columns, while splitAdditional
will
split the additional name-level pair. Finally, the function
splitAll
will split group, strata, and additional at once.
Note that after using splitStrata
on our summarised_result
object, we do no longer have a strata_name-strata_level pair,
instead we have two new columns corresponding to the stratifications,
age_group and sex.
mock_sr |> splitStrata() |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ age_group <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "…
#> $ sex <chr> "overall", "Male", "Male", "Female", "Female", "Male"…
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
mock_sr |> splitAdditional() |> glimpse()
#> Rows: 126
#> Columns: 11
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_na…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", "a…
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& Fe…
#> $ variable_name <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "46…
mock_sr |> splitAll() |> glimpse()
#> Rows: 126
#> Columns: 10
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ cohort_name <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ age_group <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "ov…
#> $ sex <chr> "overall", "Male", "Male", "Female", "Female", "Male", …
#> $ variable_name <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "46…
!! Keyword: &&&
Looking at the results below, observe how the splitting was not only done by values in the “name” column, but also among values containing the key word “&&&”. That is, “sex &&& age_group” was splitted into sex and age_group columns, instead of generating a column called “sex &&& age_group”.
Unite functions
The unite functions are the complementary to the split ones. These are meant to generate name-level pair columns from targeted columns within a dataframe.
uniteGroup(), uniteStrata(), and uniteAdditional()
To work with summarised_result objects, we have the
uniteGroup
, uniteStrata
, and
uniteAdditional
functions which will generate the group,
strata, and additional name-level columns respectively from a given set
of columns. For instance, in the following example we want to create the
group_name and group_level columns:
to_unite_group <- tibble(
denominator_cohort_name = c("general_population", "older_than_60", "younger_than_60"),
outcome_cohort_name = c("stroke", "stroke", "stroke")
)
to_unite_group |>
uniteGroup(cols = c("denominator_cohort_name", "outcome_cohort_name"))
#> # A tibble: 3 × 2
#> group_name group_level
#> <chr> <chr>
#> 1 denominator_cohort_name &&& outcome_cohort_name general_population &&& stroke
#> 2 denominator_cohort_name &&& outcome_cohort_name older_than_60 &&& stroke
#> 3 denominator_cohort_name &&& outcome_cohort_name younger_than_60 &&& stroke
A part from the columns to unite argument (cols
), there
is the argument ignore
, by default:
ignore = c(NA, "overall")
. This means that, levels within
ignore will be ignored. For example if in this case we do not ignore
them we will obtain the NA as output:
to_unite_strata <- tibble(
age = c(NA, ">40", "<=40", NA, NA, NA, NA, NA, ">40", "<=40"),
sex = c(NA, NA, NA, "F", "M", NA, NA, NA, "F", "M"),
region = c(NA, NA, NA, NA, NA, "North", "South", "Center", NA, NA)
)
to_unite_strata |>
uniteStrata(cols = c("age", "sex", "region"),
ignore = character())
#> # A tibble: 10 × 2
#> strata_name strata_level
#> <chr> <chr>
#> 1 age &&& sex &&& region NA &&& NA &&& NA
#> 2 age &&& sex &&& region >40 &&& NA &&& NA
#> 3 age &&& sex &&& region <=40 &&& NA &&& NA
#> 4 age &&& sex &&& region NA &&& F &&& NA
#> 5 age &&& sex &&& region NA &&& M &&& NA
#> 6 age &&& sex &&& region NA &&& NA &&& North
#> 7 age &&& sex &&& region NA &&& NA &&& South
#> 8 age &&& sex &&& region NA &&& NA &&& Center
#> 9 age &&& sex &&& region >40 &&& F &&& NA
#> 10 age &&& sex &&& region <=40 &&& M &&& NA
By default (ignore = c(NA, "overall")
) we obtain an
output where only names and levels of non-NA values are returned, and
from those rows where all values are NA it uses “overall”.
to_unite_strata |>
uniteStrata(cols = c("age", "sex", "region"))
#> # A tibble: 10 × 2
#> strata_name strata_level
#> <chr> <chr>
#> 1 overall overall
#> 2 age >40
#> 3 age <=40
#> 4 sex F
#> 5 sex M
#> 6 region North
#> 7 region South
#> 8 region Center
#> 9 age &&& sex >40 &&& F
#> 10 age &&& sex <=40 &&& M