Add skiprows and nrows to parquet reader #16214

lithomas1 · 2024-07-08T14:23:53Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence- · 2024-07-09T13:22:48Z

python/cudf/cudf/io/parquet.py

+                # TODO: is this still right?
+                # Also, do we still care?
+                # partition_keys uses pyarrow dataset
+                # (which we can't use anymore after pyarrow is gone)
+                nrows=nrows,
+                skip_rows=skip_rows,


No, I think this is wrong. These dfs are concatenated vertically, so after reading each df, one should do:

nrows = max(nrows - len(df), 0)

Updating skip_rows is more complicated because if you skipped all the rows in a file then you can't know if you should reduce skip_rows to zero or to skip_rows - num_rows_in_file.

Ah, so the nrows and skiprows are not per-file.

That makes sense, thanks for the clarification.

Multiple files are in this regard, I believe, an implementation detail.

But let's cc @rjzamora to confirm

It seems a feels a bit ugly/inefficient to use nrows and/or skiprows when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to _read_parquet.

The "efficient" solution would probably have us read the row-count metadata for each element of key_paths up front. This way we could avoid reading unnecessary files/rows altogether. Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.

which we can't use anymore after pyarrow is gone

My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).

It seems a feels a bit ugly/inefficient to use nrows and/or skiprows when we are reading from a partitioned dataset, but you are right that we wouldn't want to pass these parameters "as provided" to _read_parquet.

The "efficient" solution would probably have us read the row-count metadata for each element of key_paths up front. This way we could avoid reading unnecessary files/rows altogether.

I think this makes sense. I'm not too familiar with the parquet code/the format in general, but do we just call read_parquet_metadata on each file to get the row counts? Then we iterate through partitions until we satisfy nrows/skiprows.

Of course, if the user passes in a filter, we would need to read all data with the filter and perform the row-trimming after the fact.

I think it might be possible to optimize further since row group row counts should be available for us.
(but we'd still have to do filtering post read)

At any rate, I think for now it's probably best to punt on this since we don't need this for anything at the moment.
(We can revisit if it turns out e.g. polars/someone else needs this)

At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.

which we can't use anymore after pyarrow is gone

My understanding is that the pyarrow-removal effort only applies to the cython/c++ level. We are still allowed to depend on pyarrow at the python level (removing pyarrow would be a nightmare for dask-cudf at this point).

👍

At any rate, I think I might punt on this for now (since it doesn't look like making nrows/skiprows work for a partitioned dataset is high priority), and the rest of this PR is blocking nrows/skiprows support in the polars executor.

Yes - I don't see any reason to spend time optimizing nrows/skiprows for partitioned datasets. We can always revisit if necessary.

…nrows

mhaseeb123 · 2024-07-11T03:54:11Z

Hi @lithomas1, thank you for working on this PR. If not too much overhead, do you mind adding num_rows and skip_rows to the ParquetReader as well and close #16249

lithomas1 · 2024-07-12T15:34:35Z

Hi @lithomas1, thank you for working on this PR. If not too much overhead, do you mind adding num_rows and skip_rows to the ParquetReader as well and close #16249

Sure, will give it a shot.

…nrows

lithomas1 · 2024-07-12T17:33:51Z

@mhaseeb123

I'm hitting a bug in the chunked parquet reader for skip_rows > 0, so I don't think I can make further progress on this.
#16273

mhaseeb123 · 2024-07-12T17:54:05Z

@mhaseeb123

I'm hitting a bug in the chunked parquet reader for skip_rows > 0, so I don't think I can make further progress on this. #16273

Thanks for working on this. If you are hitting this #16186, then please don't discard your changes. The bug should go away once #16195 merges. If you like you can pull in the changes from it and retest on your end but we can wait until the merge as well to go ahead with this.

lithomas1 · 2024-07-12T18:27:57Z

Thanks for working on this. If you are hitting this #16186, then please don't discard your changes. The bug should go away once #16195 merges. If you like you can pull in the changes from it and retest on your end but we can wait until the merge as well to go ahead with this.

Thanks for the quick fix!

(I can't believe I missed that PR. The auto assigner assigned me to review that one too 😅 )

I currently have my changes stashed away in another branch, and I'll wait for your PR to land to merge that branch here.

…nrows

python/cudf/cudf/tests/test_parquet.py

…nrows

python/cudf/cudf/tests/test_parquet.py

Co-authored-by: Muhammad Haseeb <[email protected]>

vyasr · 2024-07-25T03:52:38Z

Is this PR blocked on resolving #16186 or is there a partial version that we want to land in the interim before that issue is fully resolved?

mhaseeb123 · 2024-07-25T04:05:09Z

Is this PR blocked on resolving #16186 or is there a partial version that we want to land in the interim before that issue is fully resolved?

Not blocked. We can merge this without adding bindings for num_rows and skip_rows to chunked PQ reader.

mhaseeb123

Just minor changes needed and should be good to go!

mhaseeb123 · 2024-07-25T04:08:54Z

python/cudf/cudf/_lib/parquet.pyx

@@ -362,6 +376,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None,
        filters,
        convert_strings_to_categories = False,
        use_pandas_metadata = use_pandas_metadata,
+        skip_rows = skip_rows,
+        num_rows = nrows,


Let's be consistent with either num_rows or nrows across the files. @galipremsagar I can't find the same option in pyarrow.read_table or pd.read_parquet so I am sure what should be preferred here. If arbitrary, my vote would be num_rows to be consistent with C++ counterpart but not a blocker.

Yeah not sure which is better.

nrows would be consistent with read_csv, and num_rows would be consistent with libcudf.

Let's go with nrows then and further the PR to merge! 🙂

mhaseeb123

As per conversation, please use nrows consistently on python side except when passing to libcudf. Looks good otherwise!

…nrows

mhaseeb123 · 2024-07-30T17:53:42Z

Some tests still failing with the following log. Looks like some more nrows/num_rows replacements needed.

FAILED io/test_parquet.py::test_read_parquet_basic[0-binary_source_or_sink1-nrows_skiprows0-columns1] - TypeError: read_parquet() got an unexpected keyword argument 'num_rows'

mhaseeb123 · 2024-07-30T17:55:56Z

Note that the nrows only needs to be on Python side (read_parquet()) and the Cython layer may keep using num_rows if it's inline with other readers.

wence-

(Will need #16442 in to get past the cudf-polars test fails)

python/cudf/cudf/tests/test_parquet.py

python/cudf/cudf/utils/ioutils.py

…t-nrows

lithomas1 · 2024-08-01T15:27:32Z

/merge

lithomas1 added 3 commits July 5, 2024 22:53

wip

164533e

revert polars changes

3b0427a

fixes

03c1889

lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jul 8, 2024

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 8, 2024

lithomas1 added 2 commits July 8, 2024 17:21

rollback changes to chunked parquet reader

bf5e902

revert changes to parquetreader

56c88ed

lithomas1 marked this pull request as ready for review July 8, 2024 17:22

lithomas1 requested a review from a team as a code owner July 8, 2024 17:22

lithomas1 requested review from wence- and brandon-b-miller July 8, 2024 17:22

lithomas1 marked this pull request as draft July 8, 2024 17:25

raise notimplemented for chunked parquet reader nrows/skiprows

f52b606

lithomas1 marked this pull request as ready for review July 8, 2024 17:59

fix docs

3eeb95a

wence- reviewed Jul 9, 2024

View reviewed changes

lithomas1 added 3 commits July 9, 2024 20:30

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into parquet-…

0c722da

…nrows

notimplemented for partitioned as well

cc37737

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into parquet-…

5e3037e

…nrows

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into parquet-…

917761f

…nrows

lithomas1 added 2 commits July 22, 2024 13:42

buggy chunked parquet reader nrows/skiprows

9c6a5da

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into parquet-…

a78f97f

…nrows

github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Jul 22, 2024

lithomas1 commented Jul 22, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

mhaseeb123 and others added 3 commits July 22, 2024 22:16

Merge branch 'branch-24.08' into parquet-nrows

3900019

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into parquet-…

5055fd0

…nrows

fix range index metadata processing

7bd2438

mhaseeb123 reviewed Jul 23, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

lithomas1 and others added 2 commits July 23, 2024 12:31

Update python/cudf/cudf/tests/test_parquet.py

e1982fa

Co-authored-by: Muhammad Haseeb <[email protected]>

update

30faf88

mhaseeb123 requested changes Jul 25, 2024

View reviewed changes

lithomas1 changed the base branch from branch-24.08 to branch-24.10 July 25, 2024 20:08

mhaseeb123 approved these changes Jul 25, 2024

View reviewed changes

lithomas1 added 2 commits July 29, 2024 18:47

rename params

d140184

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into parquet-…

2943f74

…nrows

github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jul 29, 2024

lithomas1 and others added 3 commits July 30, 2024 06:49

fix typo

07411c1

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into parquet-…

9ce3ceb

…nrows

another missed one

18082c7

fix pylibcudf tests

fec25b0

wence- approved these changes Jul 31, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

python/cudf/cudf/utils/ioutils.py Show resolved Hide resolved

lithomas1 added 2 commits August 1, 2024 15:11

last fixes

a2fad68

Merge branch 'parquet-nrows' of github.com:lithomas1/cudf into parque…

6447f12

…t-nrows

lithomas1 added 2 commits August 1, 2024 08:45

Merge branch 'branch-24.10' into parquet-nrows

9ea93ba

fix cudf-polars

593ccd2

github-actions bot added the cudf.polars Issues specific to cudf.polars label Aug 1, 2024

rapids-bot bot merged commit 9d0c57a into rapidsai:branch-24.10 Aug 1, 2024
81 checks passed

lithomas1 deleted the parquet-nrows branch August 1, 2024 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skiprows and nrows to parquet reader #16214

Add skiprows and nrows to parquet reader #16214

lithomas1 commented Jul 8, 2024 •

edited by mhaseeb123

Loading

wence- Jul 9, 2024

lithomas1 Jul 9, 2024

wence- Jul 9, 2024 •

edited

Loading

wence- Jul 9, 2024

rjzamora Jul 9, 2024

lithomas1 Jul 9, 2024

rjzamora Jul 10, 2024

mhaseeb123 commented Jul 11, 2024 •

edited

Loading

lithomas1 commented Jul 12, 2024

lithomas1 commented Jul 12, 2024

mhaseeb123 commented Jul 12, 2024 •

edited

Loading

lithomas1 commented Jul 12, 2024

vyasr commented Jul 25, 2024

mhaseeb123 commented Jul 25, 2024

mhaseeb123 left a comment

mhaseeb123 Jul 25, 2024 •

edited

Loading

lithomas1 Jul 25, 2024

mhaseeb123 Jul 25, 2024 •

edited

Loading

mhaseeb123 left a comment

mhaseeb123 commented Jul 30, 2024

mhaseeb123 commented Jul 30, 2024

wence- left a comment

lithomas1 commented Aug 1, 2024

Add skiprows and nrows to parquet reader #16214

Add skiprows and nrows to parquet reader #16214

Conversation

lithomas1 commented Jul 8, 2024 • edited by mhaseeb123 Loading

Description

Checklist

wence- Jul 9, 2024

Choose a reason for hiding this comment

lithomas1 Jul 9, 2024

Choose a reason for hiding this comment

wence- Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

wence- Jul 9, 2024

Choose a reason for hiding this comment

rjzamora Jul 9, 2024

Choose a reason for hiding this comment

lithomas1 Jul 9, 2024

Choose a reason for hiding this comment

rjzamora Jul 10, 2024

Choose a reason for hiding this comment

mhaseeb123 commented Jul 11, 2024 • edited Loading

lithomas1 commented Jul 12, 2024

lithomas1 commented Jul 12, 2024

mhaseeb123 commented Jul 12, 2024 • edited Loading

lithomas1 commented Jul 12, 2024

vyasr commented Jul 25, 2024

mhaseeb123 commented Jul 25, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

lithomas1 Jul 25, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Jul 30, 2024

mhaseeb123 commented Jul 30, 2024

wence- left a comment

Choose a reason for hiding this comment

lithomas1 commented Aug 1, 2024

lithomas1 commented Jul 8, 2024 •

edited by mhaseeb123

Loading

wence- Jul 9, 2024 •

edited

Loading

mhaseeb123 commented Jul 11, 2024 •

edited

Loading

mhaseeb123 commented Jul 12, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading