Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement guniqueN #1120

Open
jangorecki opened this issue Apr 16, 2015 · 11 comments
Open

implement guniqueN #1120

jangorecki opened this issue Apr 16, 2015 · 11 comments
Labels
benchmark enhancement GForce issues relating to optimized grouping calculations (GForce) performance top request One of our most-requested issues

Comments

@jangorecki
Copy link
Member

Most recent data.table. Not always, but quite often...

library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602     1
#                          DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024     1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212     1
#                          DT[, uniqueN(x), y]  843.0670  843.0670  843.0670  843.0670  843.0670  843.0670     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  804.7881  804.7881  804.7881  804.7881  804.7881  804.7881     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365     1
#                          DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721     1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589     1
#                          DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244     1
#                          DT[, uniqueN(x), y]  998.040  998.040  998.040  998.040  998.040  998.040     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867     1
N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865     1
#                          DT[, uniqueN(x), y]  573.8628  573.8628  573.8628  573.8628  573.8628  573.8628     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  528.3269  528.3269  528.3269  528.3269  528.3269  528.3269     1

Related SO: http://stackoverflow.com/a/29684533/2490497

R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5     microbenchmark_1.4-2

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 devtools_1.7.0   digest_0.6.8     evaluate_0.5.5   formatR_1.0      ggplot2_1.0.0    grid_3.1.3      
[10] gtable_0.1.2     httr_0.6.1       knitr_1.8        MASS_7.3-37      munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.4      RCurl_1.95-4.5  
[19] reshape2_1.4.1   scales_0.2.4     stringr_0.6.2    tools_3.1.3    
@ben519
Copy link

ben519 commented Nov 6, 2016

Came looking for this. I run into this issue a lot - my recent case being unbearably slow. My case looks more like this

dt <- data.table(
  A=sample(100000, 1000000, replace=TRUE), 
  B=sample(100000, 1000000, replace=TRUE), 
  C=sample(1000000, 1000000, replace=TRUE)
)

# slow
system.time(result1 <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  12.132   0.038  12.178 

# fast
system.time(result2 <- dt[, list(1), keyby=list(A, B, C)][, list(UniqueCs=.N), keyby=list(A, B)])
#    user  system elapsed 
#   0.374   0.013   0.387 

I'd think uniqueN should take about as long as aggregating with its argument.

@MichaelChirico
Copy link
Member

Confirming timings of @ben519...

Ran on 1.9.6:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#   8.032   0.004   8.029 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.496   0.004   0.498 

Ran on 1.9.7:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  11.764   0.488   9.706 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.100   0.008   0.109 

(I missed his edit, but the difference is marginal)

@MichaelChirico
Copy link
Member

Update if improved:

ben519/mltools#10

@MichaelChirico
Copy link
Member

Ideal case where uniqueN is much faster than alternatives (list i.e. non-scalar input here)

https://stackoverflow.com/a/53890905/3576984

@sindribaldur
Copy link

sindribaldur commented Feb 14, 2019

When used with with an by argument and many groups uniqueN() slowness is very bad:

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  times = 2
)
Unit: milliseconds
                                                expr        min         lq       mean     median         uq        max neval cld
        irisdt[, uniqueN(Sepal.Width), Sepal.Length] 3592.42280 3592.42280 3592.65173 3592.65173 3592.88065 3592.88065     2   b
 irisdt[, length(unique(Sepal.Width)), Sepal.Length]   73.84953   73.84953   79.74312   79.74312   85.63672   85.63672     2  a 

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 14, 2019

Using unique is the best way to go

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  unique(irisdt, by = c('Sepal.Length', 'Sepal.Width'))[ , .N, by = Sepal.Length],
  times = 100
)
# Unit: milliseconds
#                                                                            expr        min         lq
#                                    irisdt[, uniqueN(Sepal.Width), Sepal.Length] 235.857762 284.023470
#                             irisdt[, length(unique(Sepal.Width)), Sepal.Length]  56.797016  70.049278
#  unique(irisdt, by = c("Sepal.Length", "Sepal.Width"))[, .N, by = Sepal.Length]   4.076486   4.652738
#        mean     median         uq       max neval
#  370.566000 328.691682 392.016670 968.17539   100
#   73.354643  72.797490  74.845989 130.11590   100
#    5.489569   4.801915   5.080524  55.50387   100

@DavidArenburg
Copy link
Member

DavidArenburg commented Feb 14, 2019

Or setkey(irisdt, Sepal.Width, Sepal.Length) ; irisdt[, .N, by = .(Sepal.Width, Sepal.Length)][ , .N, by = Sepal.Length] which will be faster than unique by about 30% and about X8 faster than length(unique())

But this seem irrelevant to the fact that uniqueN is about X70 slower than length(unique())

@sindribaldur
Copy link

Not sure about github etiquette... should I reply? Anyway, I just wanted to point out that uniqueN() performs particularly bad in this setting which is ok but one has come to expect anything data.table to outperform anything in almost any setting. So maybe there is an issue here? My actual application is kind of different but I'm doing fine using uniqueN2 <- function(x) length(unique(x)) which also does much better than dplyr::n_distinct().

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 14, 2019 via email

@jangorecki
Copy link
Member Author

jangorecki commented Mar 14, 2019

Related #3395, #3438
Root of this problem is that uniqueN is called for every group. uniqueN calls forder which is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementing guniqueN function.
Additionally what we could do is to force calls in j which are not gfun to be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.

@jangorecki jangorecki changed the title uniqueN slower than length(unique()) implement guniqueN Mar 15, 2019
@jangorecki jangorecki added the GForce issues relating to optimized grouping calculations (GForce) label Mar 15, 2019
@jangorecki
Copy link
Member Author

another case where setting threads to 1 would probably help is new fifelse function: 93cc9ab

@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Nov 6, 2023
@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark enhancement GForce issues relating to optimized grouping calculations (GForce) performance top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

8 participants