parallelise subset in `[.data.table` operator #2951

jangorecki · 2018-06-24T06:04:40Z

Internal not exported call to CsubsetDT was parallelised quite a long time ago, but there is no API for users to use it. I am sure we are benefiting from it internally, still it should be utilized well in [.data.table, currently it is not. Using data from #1660:

getDTthreads()
#[1] 20
system.time(ans <- .Call("CsubsetDT", x, ix, 1:cols))
#   user  system elapsed 
# 12.611   0.708   4.528 
setDTthreads(1)
system.time(ans <- .Call("CsubsetDT", x, ix, 1:cols))
#   user  system elapsed 
# 10.948   0.648  11.596 
system.time(x[ix])
#   user  system elapsed 
# 12.042   0.648  12.689 
setDTthreads(20)
system.time(x[ix])
#   user  system elapsed 
# 12.035   0.608  12.643

The text was updated successfully, but these errors were encountered:

mattdowle · 2018-12-12T22:51:42Z

In dev as of #3170, it's now parallel within column. So [.data.table now benefits because it's going via repeated calls to CsubsetVector. However, the index is checked and rechecked for each column, so that can be avoided. Still worth [.data.table calling CsubsetDT but the benefit isn't as much as before (see below). Also there's a speedup in dev vs v1.11.8 just because it was checking out-of-bounds and dealing with NA even when it didn't need to.
The timings are highly sensitive to whether any character columns are present because they have to go via the write barrier. factor columns are much faster.

N=1e8; K=100
set.seed(1)
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # any character profile will do
  id2 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # any character profile will do
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  sample(5, N, TRUE),                          # int in range [1,5]
  v4 =  sample(5, N, TRUE),                          # int in range [1,5]
  v5 =  sample(5, N, TRUE),                          # int in range [1,5]
  v6 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
x = which(DT$v1 > 3)
length(x)/nrow(DT)  # select 40% of the rows

                                                      #  dev     v1.11.8
                                                      #  ---     ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.6     1.9       seconds
system.time(DT[x])                                    #  1.7     4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.9     2.7
system.time(DT[x])                                    #  2.0     4.9

DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]

setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.4     0.6
system.time(DT[x])                                    #  0.6     3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.8     1.5
system.time(DT[x])                                    #  1.0     3.6

So as of dev now, using CsubsetDT from [.data.table should reduce (1.7,2.0) to (1.6,1.9) and (0.6,1.0) to (0.4,0.8). But those differences should increase as the number of columns increases. Plus it's good to use CsubsetDT from [.data.table to reduce lines of R code and have more going through a central place.

mattdowle · 2018-12-13T03:21:21Z

Now with #3210 merged, DT[x] is the same as CsubsetDT :

                                                      #  dev     v1.11.8
                                                      #  ---     ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.6     1.9       seconds
system.time(DT[x])                                    #  1.6     4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  1.9     2.7
system.time(DT[x])                                    #  1.9     4.9

DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]

setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.4     0.6
system.time(DT[x])                                    #  0.4     3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT)))    #  0.8     1.5
system.time(DT[x])                                    #  0.8     3.6

jangorecki added the openmp label Jun 24, 2018

mattdowle added this to the 1.12.0 milestone Dec 1, 2018

mattdowle mentioned this issue Dec 13, 2018

DT[i] now calls parallel CsubsetDT #3210

Merged

mattdowle closed this as completed in #3210 Dec 13, 2018

mattdowle mentioned this issue Dec 13, 2018

DT[i, cols] should call CsubsetDT #3212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelise subset in `[.data.table` operator #2951

parallelise subset in `[.data.table` operator #2951

jangorecki commented Jun 24, 2018

mattdowle commented Dec 12, 2018 •

edited

Loading

mattdowle commented Dec 13, 2018

parallelise subset in [.data.table operator #2951

parallelise subset in [.data.table operator #2951

Comments

jangorecki commented Jun 24, 2018

mattdowle commented Dec 12, 2018 • edited Loading

mattdowle commented Dec 13, 2018

parallelise subset in `[.data.table` operator #2951

parallelise subset in `[.data.table` operator #2951

mattdowle commented Dec 12, 2018 •

edited

Loading