This function provides DBMS independent syntax for quantile estimation. Some
database systems do not have a quantile
function. The SQL generated by
summarizeQuantile2
should work on all supported database systems. This function
can be added to a dplyr pipeline and adds an additional query to the input.
No computation is triggered by summarizeQuantile2
if the input is a tbl
reference to a database table.
Arguments
- .data
lazy data frame backed by a database query created by
dplyr::tbl()
.- x
A string vector of column names whose sample quantiles are wanted.
- probs
A numeric vector of probabilities with values in [0,1].
- nameSuffix
A single character character string, evaluated by
glue::glue()
that is appended to numerical quantile value as a column name part.
Value
A lazy query with quantile calculation added. The result (after computation) will have one row per combination of grouping variables and one column for every variable/quantile combination. (see examples)
Details
Implemented quantiles estimation algorithm returns values analogous to
quantile{stats}
with argument type = 1
.
See discussion in Hyndman and Fan (1996).
Results differ from PERCENTILE_CONT
natively implemented in various DBMS,
where returned values are equal to quantile{stats}
with default argument type = 7
Examples
if (FALSE) { # \dontrun{
con <- DBI::dbConnect(duckdb::duckdb())
mtcars_tbl <- dplyr::copy_to(con, mtcars, name = "tmp", overwrite = TRUE, temporary = TRUE)
# quantiles for a single column
mtcars_tbl %>%
dplyr::group_by(cyl) %>%
dplyr::mutate(mean = mean(mpg, na.rm = TRUE)) %>%
summariseQuantile2("mpg", probs = c(0, 0.2, 0.4, 0.6, 0.8, 1), nameSuffix = "quant") %>%
dplyr::collect()
#> cyl p0_quant p20_quant p40_quant p60_quant p80_quant p100_quant
#> 6 17.8 18.1 19.2 21 21 21.4
#> 8 10.4 13.3 15 15.5 17.3 19.2
#> 4 21.4 22.8 24.4 27.3 30.4 33.9
# multiple columns
mtcars_tbl %>%
dplyr::group_by(cyl) %>%
dplyr::mutate(mean = mean(mpg, na.rm = TRUE)) %>%
summariseQuantile2(c("mpg", "hp", "wt"), probs = c(0.2, 0.8), nameSuffix = "{x}_quant") %>%
dplyr::collect()
#> cyl p20_mpg_quant p80_mpg_quant p20_hp_quant p80_hp_quant p20_wt_quant p80_wt_quant
#> 4 22.8 30.4 65 97 1.84 2.78
#> 6 18.1 21 110 123 2.77 3.44
#> 8 13.3 17.3 175 245 3.44 5.25
DBI::dbDisconnect(con, shutdown = TRUE)
} # }