Skip to contents

This function provides DBMS independent syntax for quantile estimation. Some database systems do not have a quantile function. The SQL generated by summarizeQuantile2 should work on all supported database systems. This function can be added to a dplyr pipeline and adds an additional query to the input. No computation is triggered by summarizeQuantile2 if the input is a tbl reference to a database table.

Usage

summariseQuantile2(.data, x, probs, nameSuffix = "{x}")

Arguments

.data

lazy data frame backed by a database query created by dplyr::tbl().

x

A string vector of column names whose sample quantiles are wanted.

probs

A numeric vector of probabilities with values in [0,1].

nameSuffix

A single character character string, evaluated by glue::glue() that is appended to numerical quantile value as a column name part.

Value

A lazy query with quantile calculation added. The result (after computation) will have one row per combination of grouping variables and one column for every variable/quantile combination. (see examples)

Details

Implemented quantiles estimation algorithm returns values analogous to quantile{stats} with argument type = 1. See discussion in Hyndman and Fan (1996). Results differ from PERCENTILE_CONT natively implemented in various DBMS, where returned values are equal to quantile{stats} with default argument type = 7

[Experimental]

Examples

if (FALSE) { # \dontrun{
con <- DBI::dbConnect(duckdb::duckdb())
mtcars_tbl <- dplyr::copy_to(con, mtcars, name = "tmp", overwrite = TRUE, temporary = TRUE)

# quantiles for a single column
mtcars_tbl %>%
  dplyr::group_by(cyl) %>%
  dplyr::mutate(mean = mean(mpg, na.rm = TRUE)) %>%
  summariseQuantile2("mpg", probs = c(0, 0.2, 0.4, 0.6, 0.8, 1),  nameSuffix = "quant") %>%
  dplyr::collect()

#> cyl  p0_quant p20_quant p40_quant p60_quant p80_quant p100_quant
#>   6      17.8      18.1      19.2      21        21         21.4
#>   8      10.4      13.3      15        15.5      17.3       19.2
#>   4      21.4      22.8      24.4      27.3      30.4       33.9

# multiple columns
mtcars_tbl %>%
  dplyr::group_by(cyl) %>%
  dplyr::mutate(mean = mean(mpg, na.rm = TRUE)) %>%
  summariseQuantile2(c("mpg", "hp", "wt"), probs = c(0.2, 0.8),  nameSuffix = "{x}_quant") %>%
  dplyr::collect()

#>  cyl p20_mpg_quant p80_mpg_quant p20_hp_quant p80_hp_quant p20_wt_quant p80_wt_quant
#>    4          22.8          30.4           65           97         1.84         2.78
#>    6          18.1          21            110          123         2.77         3.44
#>    8          13.3          17.3          175          245         3.44         5.25

DBI::dbDisconnect(con, shutdown = TRUE)
} # }