implement guniqueN #1120

jangorecki · 2015-04-16T21:50:55Z

Most recent data.table. Not always, but quite often...

library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602     1
#                          DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024     1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212     1
#                          DT[, uniqueN(x), y]  843.0670  843.0670  843.0670  843.0670  843.0670  843.0670     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  804.7881  804.7881  804.7881  804.7881  804.7881  804.7881     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365     1
#                          DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721     1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589     1
#                          DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244     1
#                          DT[, uniqueN(x), y]  998.040  998.040  998.040  998.040  998.040  998.040     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867     1
N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865     1
#                          DT[, uniqueN(x), y]  573.8628  573.8628  573.8628  573.8628  573.8628  573.8628     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  528.3269  528.3269  528.3269  528.3269  528.3269  528.3269     1

Related SO: http://stackoverflow.com/a/29684533/2490497

R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5     microbenchmark_1.4-2

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 devtools_1.7.0   digest_0.6.8     evaluate_0.5.5   formatR_1.0      ggplot2_1.0.0    grid_3.1.3      
[10] gtable_0.1.2     httr_0.6.1       knitr_1.8        MASS_7.3-37      munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.4      RCurl_1.95-4.5  
[19] reshape2_1.4.1   scales_0.2.4     stringr_0.6.2    tools_3.1.3

The text was updated successfully, but these errors were encountered:

ben519 · 2016-11-06T04:15:14Z

Came looking for this. I run into this issue a lot - my recent case being unbearably slow. My case looks more like this

dt <- data.table(
  A=sample(100000, 1000000, replace=TRUE), 
  B=sample(100000, 1000000, replace=TRUE), 
  C=sample(1000000, 1000000, replace=TRUE)
)

# slow
system.time(result1 <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  12.132   0.038  12.178 

# fast
system.time(result2 <- dt[, list(1), keyby=list(A, B, C)][, list(UniqueCs=.N), keyby=list(A, B)])
#    user  system elapsed 
#   0.374   0.013   0.387

I'd think uniqueN should take about as long as aggregating with its argument.

MichaelChirico · 2016-11-06T16:13:17Z

Confirming timings of @ben519...

Ran on 1.9.6:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#   8.032   0.004   8.029 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.496   0.004   0.498

Ran on 1.9.7:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  11.764   0.488   9.706 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.100   0.008   0.109

(I missed his edit, but the difference is marginal)

MichaelChirico · 2017-12-12T02:05:15Z

Update if improved:

ben519/mltools#10

MichaelChirico · 2018-12-21T21:37:20Z

Ideal case where uniqueN is much faster than alternatives (list i.e. non-scalar input here)

https://stackoverflow.com/a/53890905/3576984

sindribaldur · 2019-02-14T10:42:18Z

When used with with an by argument and many groups uniqueN() slowness is very bad:

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  times = 2
)
Unit: milliseconds
                                                expr        min         lq       mean     median         uq        max neval cld
        irisdt[, uniqueN(Sepal.Width), Sepal.Length] 3592.42280 3592.42280 3592.65173 3592.65173 3592.88065 3592.88065     2   b
 irisdt[, length(unique(Sepal.Width)), Sepal.Length]   73.84953   73.84953   79.74312   79.74312   85.63672   85.63672     2  a

MichaelChirico · 2019-02-14T10:57:17Z

Using unique is the best way to go

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  unique(irisdt, by = c('Sepal.Length', 'Sepal.Width'))[ , .N, by = Sepal.Length],
  times = 100
)
# Unit: milliseconds
#                                                                            expr        min         lq
#                                    irisdt[, uniqueN(Sepal.Width), Sepal.Length] 235.857762 284.023470
#                             irisdt[, length(unique(Sepal.Width)), Sepal.Length]  56.797016  70.049278
#  unique(irisdt, by = c("Sepal.Length", "Sepal.Width"))[, .N, by = Sepal.Length]   4.076486   4.652738
#        mean     median         uq       max neval
#  370.566000 328.691682 392.016670 968.17539   100
#   73.354643  72.797490  74.845989 130.11590   100
#    5.489569   4.801915   5.080524  55.50387   100

DavidArenburg · 2019-02-14T11:41:56Z

Or setkey(irisdt, Sepal.Width, Sepal.Length) ; irisdt[, .N, by = .(Sepal.Width, Sepal.Length)][ , .N, by = Sepal.Length] which will be faster than unique by about 30% and about X8 faster than length(unique())

But this seem irrelevant to the fact that uniqueN is about X70 slower than length(unique())

sindribaldur · 2019-02-14T14:52:54Z

Not sure about github etiquette... should I reply? Anyway, I just wanted to point out that uniqueN() performs particularly bad in this setting which is ok but one has come to expect anything data.table to outperform anything in almost any setting. So maybe there is an issue here? My actual application is kind of different but I'm doing fine using uniqueN2 <- function(x) length(unique(x)) which also does much better than dplyr::n_distinct().

MichaelChirico · 2019-02-14T15:06:52Z

you're absolutely right that there's a problem with uniqueN & thanks for the reproducible benchmark! I just wanted to suggest valid alternatives in the meantime

…

On Thu, Feb 14, 2019, 10:53 PM Sindri ***@***.*** wrote: Not sure about github etiquette... should I reply? Anyway, I just wanted to point out that uniqueN() performs particularly bad in this setting which is ok but one has come to expect anything data.table to outperform anything in almost any setting. So maybe there is an issue here? My actual application is kind of different but I'm doing fine using uniqueN2 <- function(x) length(unique(x)) which also does much better than dplyr::n_distinct(). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdX1lqkCantKRBGeOW0w7rvE82YQbks5vNXhMgaJpZM4ECSul> .

jangorecki · 2019-03-14T11:13:38Z

Related #3395, #3438
Root of this problem is that uniqueN is called for every group. uniqueN calls forder which is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementing guniqueN function.
Additionally what we could do is to force calls in j which are not gfun to be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.

jangorecki · 2019-08-03T07:48:23Z

another case where setting threads to 1 would probably help is new fifelse function: 93cc9ab

arunsrinivasan added Medium enhancement labels May 6, 2015

franknarf1 mentioned this issue Jun 3, 2017

Performance of unique applied to j question #2184

Closed

MichaelChirico mentioned this issue Mar 23, 2018

Extend uniqueN to speed up for factors as well #2703

Open

DavidArenburg added the performance label Feb 14, 2019

jangorecki changed the title ~~uniqueN slower than length(unique())~~ implement guniqueN Mar 15, 2019

jangorecki added the GForce issues relating to optimized grouping calculations (GForce) label Mar 15, 2019

jangorecki mentioned this issue Jul 25, 2019

uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

Closed

jangorecki added a commit that referenced this issue Aug 3, 2019

uniqueN escape forder to base R for small atomic vectors, #1120

f12f600

jangorecki mentioned this issue Aug 3, 2019

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Draft

arunsrinivasan added High and removed Medium labels Aug 31, 2019

MichaelChirico mentioned this issue Oct 15, 2019

Master list of most-requested issues #3189

Open

76 tasks

jangorecki mentioned this issue Apr 7, 2020

Rerun repeated uniqueN test #3438

Closed

jangorecki mentioned this issue May 25, 2020

throttle threads for iterated small data tasks #4484

Merged

jangorecki removed the High label Jun 8, 2020

mattdowle added this to the 1.12.11 milestone Jun 18, 2020

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki removed this from the 1.16.0 milestone Nov 6, 2023

MichaelChirico mentioned this issue Nov 14, 2023

Optimize unique and count group numbers operations #5740

Closed

tdhock added the benchmark label Nov 14, 2023

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement guniqueN #1120

implement guniqueN #1120

jangorecki commented Apr 16, 2015

ben519 commented Nov 6, 2016 •

edited by MichaelChirico

Loading

MichaelChirico commented Nov 6, 2016

MichaelChirico commented Dec 12, 2017

MichaelChirico commented Dec 21, 2018

sindribaldur commented Feb 14, 2019 •

edited

Loading

MichaelChirico commented Feb 14, 2019 •

edited

Loading

DavidArenburg commented Feb 14, 2019 •

edited

Loading

sindribaldur commented Feb 14, 2019

MichaelChirico commented Feb 14, 2019 via email

jangorecki commented Mar 14, 2019 •

edited

Loading

jangorecki commented Aug 3, 2019

implement guniqueN #1120

implement guniqueN #1120

Comments

jangorecki commented Apr 16, 2015

ben519 commented Nov 6, 2016 • edited by MichaelChirico Loading

MichaelChirico commented Nov 6, 2016

MichaelChirico commented Dec 12, 2017

MichaelChirico commented Dec 21, 2018

sindribaldur commented Feb 14, 2019 • edited Loading

MichaelChirico commented Feb 14, 2019 • edited Loading

DavidArenburg commented Feb 14, 2019 • edited Loading

sindribaldur commented Feb 14, 2019

MichaelChirico commented Feb 14, 2019 via email

jangorecki commented Mar 14, 2019 • edited Loading

jangorecki commented Aug 3, 2019

ben519 commented Nov 6, 2016 •

edited by MichaelChirico

Loading

sindribaldur commented Feb 14, 2019 •

edited

Loading

MichaelChirico commented Feb 14, 2019 •

edited

Loading

DavidArenburg commented Feb 14, 2019 •

edited

Loading

jangorecki commented Mar 14, 2019 •

edited

Loading