Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unique and count group numbers operations #5740

Closed
Yunuuuu opened this issue Nov 14, 2023 · 4 comments
Closed

Optimize unique and count group numbers operations #5740

Yunuuuu opened this issue Nov 14, 2023 · 4 comments

Comments

@Yunuuuu
Copy link

Yunuuuu commented Nov 14, 2023

For larger dataset, it appears that dplyr outperforms when counting the unique number by group. This conclusion is supported by a visual comparison available at the following source: https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame/77478140.

image

I also compared these operations in large dataset, it seems dplyr_unique_then_count does best

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
bench::mark(
  base_32M = aggregate(hp ~ cyl, df_32M, function(x) length(unique(x))),
  dplyr_length_unique = dplyr::summarise(df_32M,
    count = length(unique(hp)),
    .by = cyl
  ),
  dplyr_n_distinct = dplyr::summarise(df_32M,
    count = dplyr::n_distinct(hp),
    .by = cyl
  ),
  dplyr_unique_then_count = dplyr::summarise(
    dplyr::distinct(df_32M, hp, cyl),
    count = dplyr::n(),
    .by = cyl
  ),
  dt_length_unique = dt_32M[, .(count = length(unique(hp))), keyby = cyl],
  dt_uniqueN = dt_32M[, .(count = data.table::uniqueN(hp)), keyby = cyl],
  dt_unique_then_count = {
    dt <- unique(dt_32M, by = c("hp", "cyl"), cols = character())
    dt[, list(count = .N), keyby = "cyl"]
  },
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 7 × 6
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base_32M                   13.1s    13.1s    0.0762    8.98GB    0.229
#> 2 dplyr_length_unique      695.3ms  695.3ms    1.44      1.28GB    1.44 
#> 3 dplyr_n_distinct         655.6ms  655.6ms    1.53      1.28GB    0    
#> 4 dplyr_unique_then_count  375.2ms  375.2ms    2.66    378.36MB    0    
#> 5 dt_length_unique         508.3ms  508.3ms    1.97    777.89MB    1.97 
#> 6 dt_uniqueN               549.4ms  549.4ms    1.82    457.83MB    0    
#> 7 dt_unique_then_count     469.3ms  473.4ms    2.11     122.2MB    0

Created on 2023-11-14 with reprex v2.0.2

Let us delve into this matter and begin by examining the unique operations. For un-keyed data.table, it appears that dplyr::distinct is a superior choice. There are instances when preserving the original data is imperative; hence, I opt to use unique directly rather than employing setkey initially, as the latter modifies the original data.

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
dt2_32M <- data.table::copy(dt_32M)
bench::mark(
  dplyr_unique = dplyr::distinct(df_32M, hp, cyl),
  dt_unique = unique(dt_32M, by = c("hp", "cyl"), cols = character()),
  dt_unique2 = {
    data.table::setkeyv(dt2_32M, c("hp", "cyl"))
    unique(dt2_32M, by = c("hp", "cyl"), cols = character())
  },
  dt_unique3 = {
    data <- data.table::copy(dt_32M)
    data.table::setkeyv(data, c("hp", "cyl"))
    unique(data, by = c("hp", "cyl"), cols = character())
  },
  dt_unique4 = {
    data <- dt_32M[, c("hp", "cyl")]
    data.table::setkeyv(data, c("hp", "cyl"))
    unique(data, by = c("hp", "cyl"), cols = character())
  },
  check = FALSE,
  # we set max_iterations = 1L, since `setkeyv` in `dt_unique2` will not do
  # something expensive for the same keyed data.table.
  max_iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_unique 373.15ms 373.15ms     2.68   380.94MB    0
#> 2 dt_unique    471.91ms 471.91ms     2.12   122.19MB    0
#> 3 dt_unique2   358.92ms 358.92ms     2.79   518.92MB    0
#> 4 dt_unique3      2.01s    2.01s     0.498    3.13GB    0
#> 5 dt_unique4      1.09s    1.09s     0.920 1007.16MB    0.920

Created on 2023-11-14 with reprex v2.0.2

Next, let us proceed to compare the count operations, specifically dplyr::n() versus data.table::.N. It should be noted that while the execution time of data.table_n is shorter, the memory allocation is larger.

df <- mtcars
dt_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(dt_32M)
bench::mark(
  dplyr_n = dplyr::summarise(df_32M, dplyr::n(), .by = cyl),
  data.table_n = dt_32M[, .N, keyby = "cyl"],
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_n         344ms    367ms      2.72     627MB     0
#> 2 data.table_n    296ms    296ms      3.37     855MB     3.37

Created on 2023-11-14 with reprex v2.0.2
~

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       Ubuntu 22.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       Asia/Shanghai
#>  date     2023-11-14
#>  pandoc   2.9.2.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench         1.1.3   2023-05-04 [1] CRAN (R 4.3.1)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
#>  data.table    1.14.9  2023-11-03 [1] Github (Rdatatable/data.table@e6076b0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#>  dplyr         1.1.2   2023-04-20 [1] CRAN (R 4.3.1)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.1)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.1)
#>  knitr         1.43    2023-05-25 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.3.1)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.3.1)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.1)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.1)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
#>  styler        1.10.1  2023-06-05 [1] CRAN (R 4.3.1)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.1)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.3.1)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.1)
#> 
#>  [1] /home/yun/Rlibrary/4.3
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@MichaelChirico
Copy link
Member

Thanks for the report. Please note that uniqueN() has a by= argument:

data.table::uniqueN(dt_32M, by = "cyl")

#1120 tracks the request to optimize this in more complex queries with GForce.

@Yunuuuu
Copy link
Author

Yunuuuu commented Nov 14, 2023

Thank you for the clarification. For the purpose of counting the unique number of hp grouped by cyl the function uniqueN may not be the most suitable option as it only counts the unique number of values, not by groups. Can unique be optimized for unkeyed data.table, given that it does not seem to perform better than distinct?

@jangorecki
Copy link
Member

As Michael pointed out, your issue is asking for guniqueN, which is already filled in #1120. Please subscribe to issue there.

@tdhock
Copy link
Member

tdhock commented Nov 14, 2023

@DorisAmoakohene @Anirban166 please consider this for performance testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants