You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find a serious issue that when there're millions of non-ASCII characters encoded in non-UTF8 encoding (see the following example), setkey() on that column causes data.table becomes extremely slow and throws the error that 'translateCharUTF8' must be called on a CHARSXP in the end.
After that error, all the data.table function calls end up with another error that Internal error: savetl_init checks failed.
I will investigate and report more details later. Hopefully, I can file a PR to fix this.
Example
NOTE, you have to execute this on a windows machine with GB2312 as the default encoding (i.e., a Simplified Chinese Windows Machine). Otherwise it won't work. Also, if it won't fail for the first time, try to execute twice. I've tried this on several machines in my office. I'm quite confident it's reproducible.
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] digest_0.6.15 crayon_1.3.4 withr_2.1.1 rprojroot_1.3-2 assertthat_0.2.0
[6] R6_2.2.2 backports_1.1.2 magrittr_1.5 rlang_0.2.0 cli_1.0.0
[11] rstudioapi_0.7.0-9000 testthat_1.0.2.9000 devtools_1.13.3.9000 desc_1.1.1 tools_3.4.3
[16] pkgload_0.0.0.9000 yaml_2.1.16 compiler_3.4.3 pkgbuild_0.0.0.9000 memoise_1.1.0
[21] usethis_1.3.0
UPDATES
I'm quite confident that ENC2UTF8 is very slow for millions of chars now but still not sure if the issue is caused by this or not. Moreover, it's hard for me to understand why it's slow because it seems like R itself implements enc2utf8in a similar wayEDIT:enc2utf8() takes a long time (17s) to convert 1e7 chars, too. So this is not the issue.
I doubt it's related to savetl(), savetl_end(). I'm not familiar with how the global string pool works in R. However, I doubt that the utf-8 char created by data.table gets released when gc() causes (that's why it only occurs when the number of chars is large). If the char gets released and savetl_end() tries to modify a non-existed char's truelength...
Should be related to GC, SEXP, may need PROTEC ... Basically confirmed...
The text was updated successfully, but these errors were encountered:
After hours of investigating, I think it's highly possible that it misses some kind of protection in csort_pre() . ENC2UTF8 will create a new R object that could be recollected when garbage collecting. It affects the the global string pool as well, leading to an incorrect ustr_n number :
related to #2566
I find a serious issue that when there're millions of non-ASCII characters encoded in non-UTF8 encoding (see the following example),
setkey()
on that column causesdata.table
becomes extremely slow and throws the error that'translateCharUTF8' must be called on a CHARSXP
in the end.After that error, all the
data.table
function calls end up with another error thatInternal error: savetl_init checks failed
.I will investigate and report more details later. Hopefully, I can file a PR to fix this.
Example
NOTE, you have to execute this on a windows machine with GB2312 as the default encoding (i.e., a Simplified Chinese Windows Machine). Otherwise it won't work. Also, if it won't fail for the first time, try to execute twice. I've tried this on several machines in my office. I'm quite confident it's reproducible.
session info
UPDATES
ENC2UTF8
is very slow for millions of chars now but still not sure if the issue is caused by this or not. Moreover, it's hard for me to understand why it's slow because it seems like R itself implementsenc2utf8
in a similar way EDIT:enc2utf8()
takes a long time (17s) to convert 1e7 chars, too. So this is not the issue.savetl()
,savetl_end()
. I'm not familiar with how the global string pool works in R. However, I doubt that the utf-8 char created bydata.table
gets released whengc()
causes (that's why it only occurs when the number of chars is large). If the char gets released andsavetl_end()
tries to modify a non-existed char'struelength
...The text was updated successfully, but these errors were encountered: