Optimize unique and count group numbers operations #5740

Yunuuuu · 2023-11-14T05:08:10Z

For larger dataset, it appears that dplyr outperforms when counting the unique number by group. This conclusion is supported by a visual comparison available at the following source: https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame/77478140.

I also compared these operations in large dataset, it seems dplyr_unique_then_count does best

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
bench::mark(
  base_32M = aggregate(hp ~ cyl, df_32M, function(x) length(unique(x))),
  dplyr_length_unique = dplyr::summarise(df_32M,
    count = length(unique(hp)),
    .by = cyl
  ),
  dplyr_n_distinct = dplyr::summarise(df_32M,
    count = dplyr::n_distinct(hp),
    .by = cyl
  ),
  dplyr_unique_then_count = dplyr::summarise(
    dplyr::distinct(df_32M, hp, cyl),
    count = dplyr::n(),
    .by = cyl
  ),
  dt_length_unique = dt_32M[, .(count = length(unique(hp))), keyby = cyl],
  dt_uniqueN = dt_32M[, .(count = data.table::uniqueN(hp)), keyby = cyl],
  dt_unique_then_count = {
    dt <- unique(dt_32M, by = c("hp", "cyl"), cols = character())
    dt[, list(count = .N), keyby = "cyl"]
  },
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 7 × 6
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base_32M                   13.1s    13.1s    0.0762    8.98GB    0.229
#> 2 dplyr_length_unique      695.3ms  695.3ms    1.44      1.28GB    1.44 
#> 3 dplyr_n_distinct         655.6ms  655.6ms    1.53      1.28GB    0    
#> 4 dplyr_unique_then_count  375.2ms  375.2ms    2.66    378.36MB    0    
#> 5 dt_length_unique         508.3ms  508.3ms    1.97    777.89MB    1.97 
#> 6 dt_uniqueN               549.4ms  549.4ms    1.82    457.83MB    0    
#> 7 dt_unique_then_count     469.3ms  473.4ms    2.11     122.2MB    0

^{Created on 2023-11-14 with reprex v2.0.2}

Let us delve into this matter and begin by examining the unique operations. For un-keyed data.table, it appears that dplyr::distinct is a superior choice. There are instances when preserving the original data is imperative; hence, I opt to use unique directly rather than employing setkey initially, as the latter modifies the original data.

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
dt2_32M <- data.table::copy(dt_32M)
bench::mark(
  dplyr_unique = dplyr::distinct(df_32M, hp, cyl),
  dt_unique = unique(dt_32M, by = c("hp", "cyl"), cols = character()),
  dt_unique2 = {
    data.table::setkeyv(dt2_32M, c("hp", "cyl"))
    unique(dt2_32M, by = c("hp", "cyl"), cols = character())
  },
  dt_unique3 = {
    data <- data.table::copy(dt_32M)
    data.table::setkeyv(data, c("hp", "cyl"))
    unique(data, by = c("hp", "cyl"), cols = character())
  },
  dt_unique4 = {
    data <- dt_32M[, c("hp", "cyl")]
    data.table::setkeyv(data, c("hp", "cyl"))
    unique(data, by = c("hp", "cyl"), cols = character())
  },
  check = FALSE,
  # we set max_iterations = 1L, since `setkeyv` in `dt_unique2` will not do
  # something expensive for the same keyed data.table.
  max_iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_unique 373.15ms 373.15ms     2.68   380.94MB    0
#> 2 dt_unique    471.91ms 471.91ms     2.12   122.19MB    0
#> 3 dt_unique2   358.92ms 358.92ms     2.79   518.92MB    0
#> 4 dt_unique3      2.01s    2.01s     0.498    3.13GB    0
#> 5 dt_unique4      1.09s    1.09s     0.920 1007.16MB    0.920

^{Created on 2023-11-14 with reprex v2.0.2}

Next, let us proceed to compare the count operations, specifically dplyr::n() versus data.table::.N. It should be noted that while the execution time of data.table_n is shorter, the memory allocation is larger.

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
bench::mark(
  dplyr_n = dplyr::summarise(df_32M, dplyr::n(), .by = cyl),
  data.table_n = dt_32M[, .N, keyby = "cyl"],
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_n         344ms    367ms      2.72     627MB     0
#> 2 data.table_n    296ms    296ms      3.37     855MB     3.37

^{Created on 2023-11-14 with reprex v2.0.2}
~

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       Ubuntu 22.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       Asia/Shanghai
#>  date     2023-11-14
#>  pandoc   2.9.2.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench         1.1.3   2023-05-04 [1] CRAN (R 4.3.1)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
#>  data.table    1.14.9  2023-11-03 [1] Github (Rdatatable/data.table@e6076b0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#>  dplyr         1.1.2   2023-04-20 [1] CRAN (R 4.3.1)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.1)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.1)
#>  knitr         1.43    2023-05-25 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.3.1)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.3.1)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.1)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.1)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
#>  styler        1.10.1  2023-06-05 [1] CRAN (R 4.3.1)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.1)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.3.1)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.1)
#> 
#>  [1] /home/yun/Rlibrary/4.3
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2023-11-14T05:20:53Z

Thanks for the report. Please note that uniqueN() has a by= argument:

data.table::uniqueN(dt_32M, by = "cyl")

#1120 tracks the request to optimize this in more complex queries with GForce.

Yunuuuu · 2023-11-14T05:55:25Z

Thank you for the clarification. For the purpose of counting the unique number of hp grouped by cyl the function uniqueN may not be the most suitable option as it only counts the unique number of values, not by groups. Can unique be optimized for unkeyed data.table, given that it does not seem to perform better than distinct?

jangorecki · 2023-11-14T07:50:40Z

As Michael pointed out, your issue is asking for guniqueN, which is already filled in #1120. Please subscribe to issue there.

tdhock · 2023-11-14T19:43:33Z

@DorisAmoakohene @Anirban166 please consider this for performance testing.

jangorecki closed this as completed Nov 14, 2023

jangorecki added the duplicate label Nov 14, 2023

tdhock added benchmark performance labels Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unique and count group numbers operations #5740

Optimize unique and count group numbers operations #5740

Yunuuuu commented Nov 14, 2023 •

edited

Loading

MichaelChirico commented Nov 14, 2023

Yunuuuu commented Nov 14, 2023

jangorecki commented Nov 14, 2023

tdhock commented Nov 14, 2023

Optimize unique and count group numbers operations #5740

Optimize unique and count group numbers operations #5740

Comments

Yunuuuu commented Nov 14, 2023 • edited Loading

MichaelChirico commented Nov 14, 2023

Yunuuuu commented Nov 14, 2023

jangorecki commented Nov 14, 2023

tdhock commented Nov 14, 2023

Yunuuuu commented Nov 14, 2023 •

edited

Loading