split and unite functions • visOmopResults

Split and unite are complementary functions to manipulate dataframes in R. They work with summarised_results objects (see R package omopgenerics), but they can also support R dataframes from other classes.

summarised_result

First, let’s load relevant libraries and generate a mock summarised_result object to use in the following examples.

library(visOmopResults)
library(dplyr)
mock_sr <- mockSummarisedResult()
mock_sr |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name      <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level     <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

A summarised_result contains 3 types of name-level paired columns which are targeted by the set of unite and split functions. These are the group columns which typically can contain information about cohorts, strata columns which have data on stratification for each group, and finally the additional columns which include further information not covered by group and strata.

Split functions

The idea of the split functions is to pivot the “name” (e.g. group_name) column to split each value of that column into a column in the dataframe, which values are taken by the “level” (e.g. group_level) column.

splitGroup(), splitStrata(), and splitAdditional()

For instance, the splitGroup function will target the group_name-group_level columns as seen below.

mock_sr |> splitGroup() |> glimpse()
#> Rows: 126
#> Columns: 12
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ cohort_name      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name      <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level     <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

Similar to splitStrata, the functions splitGroup will split group_name and group_level columns, while splitAdditional will split the additional name-level pair. Finally, the function splitAll will split group, strata, and additional at once. Note that after using splitStrata on our summarised_result object, we do no longer have a strata_name-strata_level pair, instead we have two new columns corresponding to the stratifications, age_group and sex.

mock_sr |> splitStrata() |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ age_group        <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "…
#> $ sex              <chr> "overall", "Male", "Male", "Female", "Female", "Male"…
#> $ variable_name    <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> "807501", "8343330", "6007609", "1572084", "73994", "…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
mock_sr |> splitAdditional() |> glimpse()
#> Rows: 126
#> Columns: 11
#> $ result_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name       <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ group_name     <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_na…
#> $ group_level    <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ strata_name    <chr> "overall", "age_group &&& sex", "age_group &&& sex", "a…
#> $ strata_level   <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& Fe…
#> $ variable_name  <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name  <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type  <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "46…
mock_sr |> splitAll() |> glimpse()
#> Rows: 126
#> Columns: 10
#> $ result_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name       <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ cohort_name    <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ age_group      <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "ov…
#> $ sex            <chr> "overall", "Male", "Male", "Female", "Female", "Male", …
#> $ variable_name  <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name  <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type  <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "807501", "8343330", "6007609", "1572084", "73994", "46…

!! Keyword: &&&

Looking at the results below, observe how the splitting was not only done by values in the “name” column, but also among values containing the key word “&&&”. That is, “sex &&& age_group” was splitted into sex and age_group columns, instead of generating a column called “sex &&& age_group”.

Unite functions

The unite functions are the complementary to the split ones. These are meant to generate name-level pair columns from targeted columns within a dataframe.

uniteGroup(), uniteStrata(), and uniteAdditional()

To work with summarised_result objects, we have the uniteGroup, uniteStrata, and uniteAdditional functions which will generate the group, strata, and additional name-level columns respectively from a given set of columns. For instance, in the following example we want to create the group_name and group_level columns:

to_unite_group <- tibble(
  denominator_cohort_name = c("general_population", "older_than_60", "younger_than_60"),
  outcome_cohort_name = c("stroke", "stroke", "stroke")
)

to_unite_group |>
  uniteGroup(cols = c("denominator_cohort_name", "outcome_cohort_name"))
#> # A tibble: 3 × 2
#>   group_name                                      group_level                  
#>   <chr>                                           <chr>                        
#> 1 denominator_cohort_name &&& outcome_cohort_name general_population &&& stroke
#> 2 denominator_cohort_name &&& outcome_cohort_name older_than_60 &&& stroke     
#> 3 denominator_cohort_name &&& outcome_cohort_name younger_than_60 &&& stroke

A part from the columns to unite argument (cols), there is the argument ignore, by default: ignore = c(NA, "overall"). This means that, levels within ignore will be ignored. For example if in this case we do not ignore them we will obtain the NA as output:

to_unite_strata <- tibble(
    age = c(NA, ">40", "<=40", NA, NA, NA, NA, NA, ">40", "<=40"),
    sex = c(NA, NA, NA, "F", "M", NA, NA, NA, "F", "M"),
    region = c(NA, NA, NA, NA, NA, "North", "South", "Center", NA, NA)
  )

to_unite_strata |>
  uniteStrata(cols = c("age", "sex", "region"),
              ignore = character())
#> # A tibble: 10 × 2
#>    strata_name            strata_level        
#>    <chr>                  <chr>               
#>  1 age &&& sex &&& region NA &&& NA &&& NA    
#>  2 age &&& sex &&& region >40 &&& NA &&& NA   
#>  3 age &&& sex &&& region <=40 &&& NA &&& NA  
#>  4 age &&& sex &&& region NA &&& F &&& NA     
#>  5 age &&& sex &&& region NA &&& M &&& NA     
#>  6 age &&& sex &&& region NA &&& NA &&& North 
#>  7 age &&& sex &&& region NA &&& NA &&& South 
#>  8 age &&& sex &&& region NA &&& NA &&& Center
#>  9 age &&& sex &&& region >40 &&& F &&& NA    
#> 10 age &&& sex &&& region <=40 &&& M &&& NA

By default (ignore = c(NA, "overall")) we obtain an output where only names and levels of non-NA values are returned, and from those rows where all values are NA it uses “overall”.

to_unite_strata |>
  uniteStrata(cols = c("age", "sex", "region"))
#> # A tibble: 10 × 2
#>    strata_name strata_level
#>    <chr>       <chr>       
#>  1 overall     overall     
#>  2 age         >40         
#>  3 age         <=40        
#>  4 sex         F           
#>  5 sex         M           
#>  6 region      North       
#>  7 region      South       
#>  8 region      Center      
#>  9 age &&& sex >40 &&& F   
#> 10 age &&& sex <=40 &&& M