Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skiprows and nrows to parquet reader #16214

Merged
merged 30 commits into from
Aug 1, 2024

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Jul 8, 2024

Description

closes #15144

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lithomas1 lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jul 8, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 8, 2024
@lithomas1 lithomas1 marked this pull request as ready for review July 8, 2024 17:22
@lithomas1 lithomas1 requested a review from a team as a code owner July 8, 2024 17:22
@lithomas1 lithomas1 marked this pull request as draft July 8, 2024 17:25
@lithomas1 lithomas1 marked this pull request as ready for review July 8, 2024 17:59
Comment on lines 845 to 850
# TODO: is this still right?
# Also, do we still care?
# partition_keys uses pyarrow dataset
# (which we can't use anymore after pyarrow is gone)
nrows=nrows,
skip_rows=skip_rows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is wrong. These dfs are concatenated vertically, so after reading each df, one should do:

nrows = max(nrows - len(df), 0)

Updating skip_rows is more complicated because if you skipped all the rows in a file then you can't know if you should reduce skip_rows to zero or to skip_rows - num_rows_in_file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the nrows and skiprows are not per-file.

That makes sense, thanks for the clarification.

Copy link
Contributor

@wence- wence- Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple files are in this regard, I believe, an implementation detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But let's cc @rjzamora to confirm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a feels a bit ugly/inefficient to use nrows and/or skiprows when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to _read_parquet.

The "efficient" solution would probably have us read the row-count metadata for each element of key_paths up front. This way we could avoid reading unnecessary files/rows altogether. Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.

which we can't use anymore after pyarrow is gone

My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a feels a bit ugly/inefficient to use nrows and/or skiprows when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to _read_parquet.

The "efficient" solution would probably have us read the row-count metadata for each element of key_paths up front. This way we could avoid reading unnecessary files/rows altogether.

I think this makes sense. I'm not too familiar with the parquet code/the format in general, but do we just call read_parquet_metadata on each file to get the row counts? Then we iterate through partitions until we satisfy nrows/skiprows.

Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.

I think it might be possible to optimize further since row group row counts should be available for us.
(but we'd still have to do filtering post read)

At any rate, I think for now it's probably best to punt on this since we don't need this for anything at the moment.
(We can revisit if it turns out e.g. polars/someone else needs this)

At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.

which we can't use anymore after pyarrow is gone

My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.

Yes - I don't see any reason to spend time optimizing nrows/skiprows for partitioned datasets. We can always revisit if necessary.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jul 11, 2024

Hi @lithomas1, thank you for working on this PR. If not too much overhead, do you mind adding num_rows and skip_rows to the ParquetReader as well and close #16249

@lithomas1
Copy link
Contributor Author

Hi @lithomas1, thank you for working on this PR. If not too much overhead, do you mind adding num_rows and skip_rows to the ParquetReader as well and close #16249

Sure, will give it a shot.

@lithomas1
Copy link
Contributor Author

@mhaseeb123

I'm hitting a bug in the chunked parquet reader for skip_rows > 0, so I don't think I can make further progress on this.
#16273

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jul 12, 2024

@mhaseeb123

I'm hitting a bug in the chunked parquet reader for skip_rows > 0, so I don't think I can make further progress on this. #16273

Thanks for working on this. If you are hitting this #16186, then please don't discard your changes. The bug should go away once #16195 merges. If you like you can pull in the changes from it and retest on your end but we can wait until the merge as well to go ahead with this.

@lithomas1
Copy link
Contributor Author

Thanks for working on this. If you are hitting this #16186, then please don't discard your changes. The bug should go away once #16195 merges. If you like you can pull in the changes from it and retest on your end but we can wait until the merge as well to go ahead with this.

Thanks for the quick fix!

(I can't believe I missed that PR. The auto assigner assigned me to review that one too 😅 )

I currently have my changes stashed away in another branch, and I'll wait for your PR to land to merge that branch here.

@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Jul 22, 2024
@vyasr
Copy link
Contributor

vyasr commented Jul 25, 2024

Is this PR blocked on resolving #16186 or is there a partial version that we want to land in the interim before that issue is fully resolved?

@mhaseeb123
Copy link
Member

Is this PR blocked on resolving #16186 or is there a partial version that we want to land in the interim before that issue is fully resolved?

Not blocked. We can merge this without adding bindings for num_rows and skip_rows to chunked PQ reader.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor changes needed and should be good to go!

@@ -362,6 +376,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None,
filters,
convert_strings_to_categories = False,
use_pandas_metadata = use_pandas_metadata,
skip_rows = skip_rows,
num_rows = nrows,
Copy link
Member

@mhaseeb123 mhaseeb123 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with either num_rows or nrows across the files. @galipremsagar I can't find the same option in pyarrow.read_table or pd.read_parquet so I am sure what should be preferred here. If arbitrary, my vote would be num_rows to be consistent with C++ counterpart but not a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure which is better.

nrows would be consistent with read_csv, and num_rows would be consistent with libcudf.

Copy link
Member

@mhaseeb123 mhaseeb123 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with nrows then and further the PR to merge! 🙂

@lithomas1 lithomas1 changed the base branch from branch-24.08 to branch-24.10 July 25, 2024 20:08
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per conversation, please use nrows consistently on python side except when passing to libcudf. Looks good otherwise!

@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jul 29, 2024
@mhaseeb123
Copy link
Member

Some tests still failing with the following log. Looks like some more nrows/num_rows replacements needed.

FAILED io/test_parquet.py::test_read_parquet_basic[0-binary_source_or_sink1-nrows_skiprows0-columns1] - TypeError: read_parquet() got an unexpected keyword argument 'num_rows'

@mhaseeb123
Copy link
Member

Note that the nrows only needs to be on Python side (read_parquet()) and the Cython layer may keep using num_rows if it's inline with other readers.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Will need #16442 in to get past the cudf-polars test fails)

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Show resolved Hide resolved
@lithomas1
Copy link
Contributor Author

/merge

@github-actions github-actions bot added the cudf.polars Issues specific to cudf.polars label Aug 1, 2024
@rapids-bot rapids-bot bot merged commit 9d0c57a into rapidsai:branch-24.10 Aug 1, 2024
81 checks passed
@lithomas1 lithomas1 deleted the parquet-nrows branch August 1, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Add python bindings in the parquet reader for num_rows/skiprows
6 participants