-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelise subset in [.data.table
operator
#2951
Comments
In dev as of #3170, it's now parallel within column. So N=1e8; K=100
set.seed(1)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # any character profile will do
id2 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # any character profile will do
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(5, N, TRUE), # int in range [1,5]
v4 = sample(5, N, TRUE), # int in range [1,5]
v5 = sample(5, N, TRUE), # int in range [1,5]
v6 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
x = which(DT$v1 > 3)
length(x)/nrow(DT) # select 40% of the rows
# dev v1.11.8
# --- ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.6 1.9 seconds
system.time(DT[x]) # 1.7 4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.9 2.7
system.time(DT[x]) # 2.0 4.9
DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.4 0.6
system.time(DT[x]) # 0.6 3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.8 1.5
system.time(DT[x]) # 1.0 3.6 So as of dev now, using |
Now with #3210 merged, # dev v1.11.8
# --- ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.6 1.9 seconds
system.time(DT[x]) # 1.6 4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.9 2.7
system.time(DT[x]) # 1.9 4.9
DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.4 0.6
system.time(DT[x]) # 0.4 3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.8 1.5
system.time(DT[x]) # 0.8 3.6 |
Internal not exported call to
CsubsetDT
was parallelised quite a long time ago, but there is no API for users to use it. I am sure we are benefiting from it internally, still it should be utilized well in[.data.table
, currently it is not. Using data from #1660:The text was updated successfully, but these errors were encountered: