Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelise subset in [.data.table operator #2951

Closed
jangorecki opened this issue Jun 24, 2018 · 2 comments · Fixed by #3210
Closed

parallelise subset in [.data.table operator #2951

jangorecki opened this issue Jun 24, 2018 · 2 comments · Fixed by #3210
Labels
Milestone

Comments

@jangorecki
Copy link
Member

Internal not exported call to CsubsetDT was parallelised quite a long time ago, but there is no API for users to use it. I am sure we are benefiting from it internally, still it should be utilized well in [.data.table, currently it is not. Using data from #1660:

getDTthreads()
#[1] 20
system.time(ans <- .Call("CsubsetDT", x, ix, 1:cols))
#   user  system elapsed 
# 12.611   0.708   4.528 
setDTthreads(1)
system.time(ans <- .Call("CsubsetDT", x, ix, 1:cols))
#   user  system elapsed 
# 10.948   0.648  11.596 
system.time(x[ix])
#   user  system elapsed 
# 12.042   0.648  12.689 
setDTthreads(20)
system.time(x[ix])
#   user  system elapsed 
# 12.035   0.608  12.643 
@mattdowle mattdowle added this to the 1.12.0 milestone Dec 1, 2018
@mattdowle
Copy link
Member

mattdowle commented Dec 12, 2018

In dev as of #3170, it's now parallel within column. So [.data.table now benefits because it's going via repeated calls to CsubsetVector. However, the index is checked and rechecked for each column, so that can be avoided. Still worth [.data.table calling CsubsetDT but the benefit isn't as much as before (see below). Also there's a speedup in dev vs v1.11.8 just because it was checking out-of-bounds and dealing with NA even when it didn't need to.
The timings are highly sensitive to whether any character columns are present because they have to go via the write barrier. factor columns are much faster.

N=1e8; K=100
set.seed(1)
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # any character profile will do
  id2 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # any character profile will do
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(5, N, TRUE),                          # int in range [1,5]
  v4 =  sample(5, N, TRUE),                          # int in range [1,5]
  v5 =  sample(5, N, TRUE),                          # int in range [1,5]
  v6 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
x = which(DT$v1 > 3)
length(x)/nrow(DT)  # select 40% of the rows

                                                      #  dev     v1.11.8
                                                      #  ---     ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.6     1.9       seconds
system.time(DT[x])                                    #  1.7     4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.9     2.7
system.time(DT[x])                                    #  2.0     4.9

DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]

setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.4     0.6
system.time(DT[x])                                    #  0.6     3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.8     1.5
system.time(DT[x])                                    #  1.0     3.6

So as of dev now, using CsubsetDT from [.data.table should reduce (1.7,2.0) to (1.6,1.9) and (0.6,1.0) to (0.4,0.8). But those differences should increase as the number of columns increases. Plus it's good to use CsubsetDT from [.data.table to reduce lines of R code and have more going through a central place.

@mattdowle
Copy link
Member

Now with #3210 merged, DT[x] is the same as CsubsetDT :

                                                      #  dev     v1.11.8
                                                      #  ---     ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.6     1.9       seconds
system.time(DT[x])                                    #  1.6     4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.9     2.7
system.time(DT[x])                                    #  1.9     4.9

DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]

setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.4     0.6
system.time(DT[x])                                    #  0.4     3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.8     1.5
system.time(DT[x])                                    #  0.8     3.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants