You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let us delve into this matter and begin by examining the unique operations. For un-keyed data.table, it appears that dplyr::distinct is a superior choice. There are instances when preserving the original data is imperative; hence, I opt to use unique directly rather than employing setkey initially, as the latter modifies the original data.
df<-mtcarsdt_32M<-data.table::rbindlist(replicate(1e6, mtcars, simplify=FALSE))
df_32M<- as.data.frame(dt_32M)
dt2_32M<-data.table::copy(dt_32M)
bench::mark(
dplyr_unique=dplyr::distinct(df_32M, hp, cyl),
dt_unique= unique(dt_32M, by= c("hp", "cyl"), cols=character()),
dt_unique2= {
data.table::setkeyv(dt2_32M, c("hp", "cyl"))
unique(dt2_32M, by= c("hp", "cyl"), cols=character())
},
dt_unique3= {
data<-data.table::copy(dt_32M)
data.table::setkeyv(data, c("hp", "cyl"))
unique(data, by= c("hp", "cyl"), cols=character())
},
dt_unique4= {
data<-dt_32M[, c("hp", "cyl")]
data.table::setkeyv(data, c("hp", "cyl"))
unique(data, by= c("hp", "cyl"), cols=character())
},
check=FALSE,
# we set max_iterations = 1L, since `setkeyv` in `dt_unique2` will not do# something expensive for the same keyed data.table.max_iterations=1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is#> disabled.#> # A tibble: 5 × 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 dplyr_unique 373.15ms 373.15ms 2.68 380.94MB 0#> 2 dt_unique 471.91ms 471.91ms 2.12 122.19MB 0#> 3 dt_unique2 358.92ms 358.92ms 2.79 518.92MB 0#> 4 dt_unique3 2.01s 2.01s 0.498 3.13GB 0#> 5 dt_unique4 1.09s 1.09s 0.920 1007.16MB 0.920
Next, let us proceed to compare the count operations, specifically dplyr::n() versus data.table::.N. It should be noted that while the execution time of data.table_n is shorter, the memory allocation is larger.
Thank you for the clarification. For the purpose of counting the unique number of hp grouped by cyl the function uniqueN may not be the most suitable option as it only counts the unique number of values, not by groups. Can unique be optimized for unkeyed data.table, given that it does not seem to perform better than distinct?
For larger dataset, it appears that dplyr outperforms when counting the unique number by group. This conclusion is supported by a visual comparison available at the following source: https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame/77478140.
I also compared these operations in large dataset, it seems
dplyr_unique_then_count
does bestCreated on 2023-11-14 with reprex v2.0.2
Let us delve into this matter and begin by examining the
unique
operations. For un-keyed data.table, it appears that dplyr::distinct is a superior choice. There are instances when preserving the original data is imperative; hence, I opt to useunique
directly rather than employingsetkey
initially, as the latter modifies the original data.Created on 2023-11-14 with reprex v2.0.2
Next, let us proceed to compare the count operations, specifically
dplyr::n()
versusdata.table::.N
. It should be noted that while the execution time ofdata.table_n
is shorter, the memory allocation is larger.Created on 2023-11-14 with reprex v2.0.2
~
Session info
The text was updated successfully, but these errors were encountered: