Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the bug when keys contain non UTF8 strings #2566

Merged
merged 9 commits into from
Jan 15, 2018

Conversation

shrektan
Copy link
Member

@shrektan shrektan commented Jan 13, 2018

Closes #2462. #1826 is the older issue...

For strings, data.table compare their values in UTF8 encoding. However, due to missing two ENC2UTF8 in csort() and csort_pre(), the order that data.table creates actually depends on the encoding. On Windows, the fact that the default encoding is not UTF8 leads to some weird output when there're strings in keys.

Please tell me that if there's anything more need to be done, thanks.

Here's what we get now.

library(data.table)
dt <- data.table(
  x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"),
  y = enc2utf8(c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")),
  z = 1:5
)
setkey(dt, x)
dt[]
##                   x                y z
## 1:         价差收入         价差收入 3
## 2: 公允价值变动损益 公允价值变动损益 1
## 3:     其他业务支出     其他业务支出 4
## 4:         红利收入         红利收入 2
## 5:     资产减值损失     资产减值损失 5


dt[J("公允价值变动损益"), z]
## [1] 1


setkey(dt, y)
dt[]
##                   x                y z
## 1:         价差收入         价差收入 3
## 2: 公允价值变动损益 公允价值变动损益 1
## 3:     其他业务支出     其他业务支出 4
## 4:         红利收入         红利收入 2
## 5:     资产减值损失     资产减值损失 5


dt[J("公允价值变动损益"), z]
## [1] 1


sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.5
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.1     htmltools_0.3.6 Rcpp_0.12.13    stringi_1.1.5  
##  [9] rmarkdown_1.8   knitr_1.17      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1

Without this commit, on a windows machine, the result would be different:

library(data.table)
dt <- data.table(
  x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"),
  y = enc2utf8(c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")),
  z = 1:5
)
setkey(dt, x)
dt[]
##                   x                y z
## 1: 公允价值变动损益 公允价值变动损益 1
## 2:         红利收入         红利收入 2
## 3:         价差收入         价差收入 3
## 4:     其他业务支出     其他业务支出 4
## 5:     资产减值损失     资产减值损失 5


dt[J("公允价值变动损益"), z]
## [1] NA


setkey(dt, y)
dt[]
##                   x                y z
## 1:         价差收入         价差收入 3
## 2: 公允价值变动损益 公允价值变动损益 1
## 3:     其他业务支出     其他业务支出 4
## 4:         红利收入         红利收入 2
## 5:     资产减值损失     资产减值损失 5'


dt[J("公允价值变动损益"), z]
## [1] 1


sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.5
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.1     htmltools_0.3.6 Rcpp_0.12.13    stringi_1.1.5  
##  [9] rmarkdown_1.8   knitr_1.17      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1

@HughParsonage
Copy link
Member

Thank you for the pull request, but you haven't followed the contribution guidelines:

  1. Every new feature or bug fix must have one or more new tests in inst/tests/tests.Rraw.
  2. Unless the change is trivial (e.g. typo fix) there must be a new entry in NEWS.

Could you add these, making sure you follow the linked contribution guidelines, then push these changes? Once this is done, one of the maintainers will review the pull request.

@shrektan
Copy link
Member Author

@HughParsonage Sorry for the ignorances. Will do it ASAP, thanks.

@codecov-io
Copy link

codecov-io commented Jan 13, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@8394f2f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #2566   +/-   ##
========================================
  Coverage          ?   91.4%           
========================================
  Files             ?      63           
  Lines             ?   12101           
  Branches          ?       0           
========================================
  Hits              ?   11061           
  Misses            ?    1040           
  Partials          ?       0
Impacted Files Coverage Δ
src/forder.c 97.86% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8394f2f...da94a09. Read the comment docs.

@MichaelChirico
Copy link
Member

good catch 👍

@mattdowle
Copy link
Member

Thanks for the nice PR.
I see the new call to ENC2UTF8() inside a loop which could iterate many times (i.e. deep). I checked that call and saw it is Arun's nice wrapper and agree with his comment there that it should not impact performance due to that if which should short-circuit often.
The only thing I can think is that ENC2UTF8 could be inlined or made a macro. That way the if would benefit from branch prediction too when ENC2UTF8 is called inside a loop, as is the case here. Since if one item in a vector is ASCII (or not) it is very likely that all the items in the vector are ASCII (or not). Leaving that to branch prediction should work better rather than trying to code it manually and dealing with edge cases. With ENC2UTF8 being a function currently, each new function call on the stack restarts the branch predictor, iiuc.

@mattdowle
Copy link
Member

mattdowle commented Jan 15, 2018

Will the PR being from your personal branch, think I have to wait for you to accept my PR to you (why it's failing to pass checks). I went ahead and merged to master.

I've invited you to be a project member. This will allow you to create branch directly in project next time so we can all push to each other's PR branches. Welcome!

@mattdowle mattdowle merged commit da94a09 into Rdatatable:master Jan 15, 2018
@shrektan
Copy link
Member Author

@mattdowle , thanks for merging this and for this great package!

You should have been able to modify my code directly, since I filed this PR with allowing the author to edit my code.

The encoding issue is indeed tricky, especially on Windows. Another idea is simply to enforce all the nonASCII strings in data.table being UTF-8 encoded. However, there're drawbacks as well...

Your commit of making ENC2UTF8() into a Macro should be beneficial for the performance. The current performance should be good enough. To confirm that, I can make a simple example to compare the performance of master to version 1.9.6. Maybe later in this week.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 16, 2018

Current build fails for me after @mattdowle 's update:

==> R CMD INSTALL --no-multiarch --with-keep.source data.table

* installing to library ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library’
* installing *source* package ‘data.table’ ...
** libs
/usr/local/opt/llvm/bin/clang -fopenmp -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/opt/gettext/lib -L/usr/local/opt/llvm/lib -Wl,-rpath,/usr/local/opt/llvm/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o freadR.o fsort.o fwrite.o fwriteR.o gsumm.o ijoin.o init.o inrange.o nqrecreateindices.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
mv data.table.so datatable.so
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id datatable.so datatable.so; fi
installing to /Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error: package or namespace load failed for ‘data.table’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table/libs/datatable.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table/libs/datatable.so, 6): Symbol not found: _ENC2UTF8
  Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table/libs/datatable.so
  Expected in: flat namespace
 in /Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table/libs/datatable.so
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’

Exited with status 1.

(this is doing Install & Restart within the .Rproj in RStudio; standard dev installation of install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table') still works...)

Also

R CMD build data.table
R CMD INSTALL data.table_1.10.5.tar.gz

Works as expected. So something about the options in RStudio's build&reload routine is going awry? But I've never seen this type of issue before...

@shrektan
Copy link
Member Author

@MichaelChirico Just tried and it's ok for me on a Windows machine~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG REPORT] Query fails if the key contains Chinese Character on Windows7
5 participants