GH-32566: [C++] Connect parquet to the new scan node #35889

westonpace · 2023-06-02T14:50:54Z

Rationale for this change

This ended up being considerably more change than just connecting parquet to the new scan node. In order to do this I had to refactor the scan node itself somewhat. It introduces the concept of scan tasks (or maybe scan streams would be a better name) to help clarify the concept of a row group (which I didn't have to worry about with CSV). I also introduced the staging area which is a slightly different approach to sequencing that I think will be much simpler.

What changes are included in this PR?

The new scan node now supports the parquet format.

Are these changes tested?

Yes

Are there any user-facing changes?

There are breaking changes to the scan2 node but this feature hasn't really been released yet.

Closes: [C++] Create fragment scanners for csv/parquet/orc/ipc #32566

westonpace · 2023-06-02T14:51:57Z

This is very much still a draft. There are still a lot of tests to add and some TODOs (column projection and row filtering) but I don't expect the overall structure to change too much if anyone wanted to take an early look.

westonpace · 2023-06-02T14:53:05Z

On the bright side, I can now reach max parallelism with about 3GB of RAM, regardless of the size of row groups (and performance looks to be about 10% better but still very early to say that)

… considerably. Now each fragment contains one or more scan tasks. Each scan task can yield a stream of batches. So, CSV, for example, is a single scan task that covers the entire file. Parquet, on the other hand, has a scan task per row group. This also makes explicit a lot of the logic that was implicit around sequencing and trying to figure out the correct batch index.

mapleFU

So, this would prevent from use_thread deadlock?

mapleFU · 2023-07-15T02:56:15Z

cpp/src/parquet/arrow/reader.cc

@@ -16,10 +16,12 @@
 // under the License.

 #include "parquet/arrow/reader.h"
+#include <sys/types.h>


Would we real need this?

No, thank you for noticing. I will clean this up soon.

mapleFU · 2023-07-15T03:03:00Z

cpp/src/parquet/arrow/reader.cc

+          }
+          if (!first) {
+            // TODO(weston): Test this case
+            return Status::Invalid("Unexpected empty row group");


Hmmm I guess a RowGroup can be empty currently. You can easily generate a case like this using python write_table

table.len() == 10000 write_table(table, 2000)

Doesn't this create a table with 5 row groups? Why would this be an empty row group?

Status WriteTable(const Table& table, int64_t chunk_size) override { RETURN_NOT_OK(table.Validate()); if (chunk_size <= 0 && table.num_rows() > 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (!table.schema()->Equals(*schema_, false)) { return Status::Invalid("table schema does not match this writer's. table:'", table.schema()->ToString(), "' this:'", schema_->ToString(), "'"); } else if (chunk_size > this->properties().max_row_group_length()) { chunk_size = this->properties().max_row_group_length(); } auto WriteRowGroup = [&](int64_t offset, int64_t size) { RETURN_NOT_OK(NewRowGroup(size)); for (int i = 0; i < table.num_columns(); i++) { RETURN_NOT_OK(WriteColumnChunk(table.column(i), offset, size)); } return Status::OK(); }; if (table.num_rows() == 0) { // Append a row group with 0 rows RETURN_NOT_OK_ELSE(WriteRowGroup(0, 0), PARQUET_IGNORE_NOT_OK(Close())); return Status::OK(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; RETURN_NOT_OK_ELSE( WriteRowGroup(offset, std::min(chunk_size, table.num_rows() - offset)), PARQUET_IGNORE_NOT_OK(Close())); } return Status::OK(); }

It's from this code. It's easy to flush the rowgroup that row_num == 0

I manually created some empty row groups. It turns this branch should be unreachable because, further up, we will have noticed that "remaining rows" is 0 and returned an end marker. I've updated this code and added a test case for this scenario in #36779

Okay, great!

westonpace · 2023-07-18T21:28:38Z

So, this would prevent from use_thread deadlock?

Yes. Since this method is async then the caller can choose not to block. Previously we used GetRecordBatchGenerator which is very similar to the method I ended up creating (ReadRowGroupsAsync). The difference is that GetRecordBatchGenerator ignores batch_size and ReadRowGroupsAsync does not. My hope is that, eventually, GetRecordBatchGenerator can be deprecated and removed.

westonpace · 2023-07-18T23:18:14Z

I think the changes here are probably too extensive to expect review. I will be breaking this PR up into multiple PRs.

Adding a new ReadRowGroupsAsync to the file reader
Reworking the scan node
Adding parquet support

westonpace · 2023-07-19T22:30:29Z

The first two of these PRs is now available:

westonpace · 2023-07-27T21:50:58Z

The second PR is now available. Once the two pre-reqs merge I will undraft this.

github-actions bot added Component: C++ Component: Parquet awaiting committer review Awaiting committer review labels Jun 2, 2023

westonpace added 6 commits July 14, 2023 15:24

WIP

01776ec

WIP

dfcc907

WIP

83ab027

WIP

85d72a2

Reverting non-essential commit

e687c70

westonpace force-pushed the experiment/parquet-new-scan branch from a9659e7 to e687c70 Compare July 15, 2023 00:26

mapleFU reviewed Jul 15, 2023

View reviewed changes

Added support for column projection

80786f7

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 18, 2023

Added row group filtering

87f7b4e

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 18, 2023

westonpace added 4 commits July 18, 2023 11:15

Remove accidentally added header files

f549145

include fixes

8d2b2b9

Fixing bad include

5999301

Minor lint fixes

b6ff36d

Still working on lint errors

a667c1d

westonpace marked this pull request as ready for review July 18, 2023 23:16

westonpace requested a review from wgtmac as a code owner July 18, 2023 23:16

westonpace marked this pull request as draft July 18, 2023 23:18

westonpace removed the request for review from wgtmac July 18, 2023 23:18

westonpace mentioned this pull request Jul 19, 2023

[Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets #36765

Closed

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 19, 2023

This was referenced Jul 20, 2023

GH-36778: [C++][Parquet] Add ReadRowGroupAsync/ReadRowGroupsAsync #36779

Closed

GH-36820: [C++] Remove a needless return in Acero ScanNode::AddScanTasks #36821

Closed

westonpace closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-32566: [C++] Connect parquet to the new scan node #35889

GH-32566: [C++] Connect parquet to the new scan node #35889

westonpace commented Jun 2, 2023 •

edited by github-actions bot

Loading

westonpace commented Jun 2, 2023

westonpace commented Jun 2, 2023

mapleFU left a comment

mapleFU Jul 15, 2023

westonpace Jul 18, 2023

mapleFU Jul 15, 2023

westonpace Jul 19, 2023

mapleFU Jul 20, 2023

westonpace Jul 20, 2023

mapleFU Jul 20, 2023

westonpace commented Jul 18, 2023

westonpace commented Jul 18, 2023

westonpace commented Jul 19, 2023 •

edited

Loading

westonpace commented Jul 27, 2023

GH-32566: [C++] Connect parquet to the new scan node #35889

GH-32566: [C++] Connect parquet to the new scan node #35889

Conversation

westonpace commented Jun 2, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

westonpace commented Jun 2, 2023

westonpace commented Jun 2, 2023

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU Jul 15, 2023

Choose a reason for hiding this comment

westonpace Jul 18, 2023

Choose a reason for hiding this comment

mapleFU Jul 15, 2023

Choose a reason for hiding this comment

westonpace Jul 19, 2023

Choose a reason for hiding this comment

mapleFU Jul 20, 2023

Choose a reason for hiding this comment

westonpace Jul 20, 2023

Choose a reason for hiding this comment

mapleFU Jul 20, 2023

Choose a reason for hiding this comment

westonpace commented Jul 18, 2023

westonpace commented Jul 18, 2023

westonpace commented Jul 19, 2023 • edited Loading

westonpace commented Jul 27, 2023

westonpace commented Jun 2, 2023 •

edited by github-actions bot

Loading

westonpace commented Jul 19, 2023 •

edited

Loading