Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

Closed
devavret opened this issue Mar 23, 2022 · 5 comments
Closed

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

devavret opened this issue Mar 23, 2022 · 5 comments
Assignees
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@devavret
Copy link
Contributor

ORC writer uses dictionary encoding on string columns and uses a custom hash map like structure. This is not needed anymore as we can use cuCo's hash maps.

Using #8476 as a template, we should be able to refactor and clean up the dictionary encoding code that currently resides in src/io/orc/dict_enc.cu

@devavret devavret added feature request New feature or request Needs Triage Need team to review and classify labels Mar 23, 2022
@GregoryKimball GregoryKimball added the cuIO cuIO issue label Mar 23, 2022
@PointKernel PointKernel self-assigned this Mar 23, 2022
@github-actions
Copy link

github-actions bot commented May 5, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@PointKernel
Copy link
Member

Still relevant

@github-actions
Copy link

github-actions bot commented Jun 4, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added Performance Performance related issue tech debt and removed Needs Triage Need team to review and classify feature request New feature or request labels Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-90d labels Apr 2, 2023
@vuule vuule assigned vuule and unassigned PointKernel Jun 29, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 14, 2023
…3580)

Issue #13326, #10495

This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations.
New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer.

PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed.
Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13580
@GregoryKimball
Copy link
Contributor

Closed by #13580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants