Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for skip_rows / num_rows options in the parquet reader. #11503

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Aug 9, 2022

Removes support for skip_rows / num_rows options in the parquet reader. Users retain control of what gets read via row groups.

Did some before/after benchmarking. As expected, this doesn't change much except for a minor boost in list reading (due to simplification of the preprocessing step). Most of the ways the row bounds affected the code was in the page setup process (making it slippery to think through the logic) and didn't do much in the actual process of decoding. A selection of before/after benchmarks (all input files ~512 MB)

ParquetRead/integral_buffer_input/29/1000/32/0/1/manual_time
Before:  bytes_per_second=31.4564G/s
After:   bytes_per_second=31.58G/s

ParquetRead/floats_buffer_input/31/1000/32/0/1/manual_time
Before:  bytes_per_second=49.2819G/s
After:   bytes_per_second=49.7408G/s

ParquetRead/string_file_input/23/1000/32/0/0/manual_time
Before:  bytes_per_second=24.634G/s
After:   bytes_per_second=24.6563G/s

ParquetRead/string_buffer_input/23/0/1/0/1/manual_time
Before:  bytes_per_second=5.03313G/s
After:   bytes_per_second=5.03535G/s

ParquetRead/list_buffer_input/24/0/1/1/1/manual_time
Before:  bytes_per_second=1.11488G/s
After:   bytes_per_second=1.31447G/s

@nvdbaranec nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 9, 2022
@nvdbaranec nvdbaranec requested review from a team as code owners August 9, 2022 22:36
@nvdbaranec nvdbaranec added the cuIO cuIO issue label Aug 9, 2022
@vuule
Copy link
Contributor

vuule commented Aug 9, 2022

Please post the impact on the reader benchmarks.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these kinds of changes!

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Python binding changes seem a cleanup leftover.

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Aug 11, 2022
@nvdbaranec
Copy link
Contributor Author

Just realized the CI failure was because of actual compile errors in the benchmarks. Fixed.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 11, 2022
@galipremsagar
Copy link
Contributor

@gpucibot merge

@codecov
Copy link

codecov bot commented Aug 11, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@fea0bda). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11503   +/-   ##
===============================================
  Coverage                ?   86.48%           
===============================================
  Files                   ?      145           
  Lines                   ?    22840           
  Branches                ?        0           
===============================================
  Hits                    ?    19753           
  Misses                  ?     3087           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants