GH-36778: [C++][Parquet] Add ReadRowGroupAsync/ReadRowGroupsAsync #36779

westonpace · 2023-07-19T22:30:05Z

Rationale for this change

The rationale is described in #36778

What changes are included in this PR?

New methods are added to parquet::arrow::FileReader which read the file asynchronously and respect the batch size property. In addition, these new methods are a bit simpler than GetRecordBatchGenerator as they are able to reuse a lot of the code in the synchronous methods.

Are these changes tested?

Yes, I've added new unit tests.

Are there any user-facing changes?

Yes, there are new methods available. There should be no breaking changes to any existing methods.

Closes: [C++][Parquet] Add an asynchronous version of ReadRowGroup/ReadRowGroups #36778

github-actions · 2023-07-19T22:30:33Z

⚠️ GitHub issue #36778 has been automatically assigned in GitHub to PR creator.

mapleFU

Rest is ok for me

cpp/src/parquet/arrow/reader.h

cpp/src/parquet/arrow/reader.cc

mapleFU · 2023-07-20T05:48:47Z

Is the logic for concat multiple batches from RowGroup implemented? Or it will be implement in the future?

cpp/src/parquet/arrow/reader.h

wgtmac · 2023-07-20T06:21:34Z

cpp/src/parquet/arrow/reader.cc

+    // do provide a batch size but even for a small batch size it is possible that a
+    // column has extremely large strings which don't fit in a single batch.
+    Future<std::vector<std::shared_ptr<ChunkedArray>>> chunked_arrays_fut =
+        ::arrow::internal::OptionalParallelForAsync(


The cpu_executor is not used here.

Nice catch, here it uses internal::GetCpuThreadPool()

@westonpace Do you want to fix this?

Yes, thank you. I have now fixed this in 5e38b00

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

cpp/src/parquet/arrow/reader.cc

wgtmac · 2023-07-20T06:59:21Z

Is the logic for concat multiple batches from RowGroup implemented? Or it will be implement in the future?

I think the ColumnReader already supports this internally.

mapleFU · 2023-07-20T07:06:24Z

I think the ColumnReader already supports this internally.

Oh I see, ColumnReaderImpl::LoadBatch would load multiple batches. It try it best to read expected rows. Thanks

cpp/src/parquet/arrow/reader.cc

cpp/src/parquet/arrow/reader.h

westonpace · 2023-07-27T21:57:39Z

@wgtmac / @mapleFU I believe I have addressed your concerns and this is ready for another round of review. The CI failures appear to be unrelated.

pitrou · 2023-08-14T16:31:43Z

cpp/src/parquet/arrow/reader.h

+  /// \param allow_sliced_batches if false, an error is raised if a batch has too much
+  ///                             data for the given batch size.  If true, smaller
+  ///                             batches will be returned instead.
+  virtual AsyncBatchGenerator ReadRowGroupAsync(int i,


Is it necessary to expose ReadRowGroupAsync in addition to ReadRowGroupsAsync? One is a trivial call to the other...

I was attempting to maintain parity with the synchronous methods above. I only need one of these four methods and so if you'd prefer I'm happy to scope this down.

At least the single-row group ReadRowGroupAsync is a trivial redirect to the several-row groups variant, so removing those two would be fine.

Sounds good. I've removed the single-row variants.

pitrou · 2023-08-14T16:33:23Z

cpp/src/parquet/arrow/reader.h

+  /// \param row_groups indices of the row groups to read
+  /// \param cpu_executor an executor to use to run CPU tasks
+  /// \param allow_sliced_batches if false, an error is raised if a batch has too much
+  ///                             data for the given batch size.  If true, smaller


I don't understand what "a batch has too much data for the given batch size" means exactly. Do you mean "a row group has too much data for the given batch size"?

Also, why is it false by default? It seems allowing it should be the default behaviour.

I don't have strong feelings on this default. I will switch it.

I have switched the default to true

pitrou · 2023-08-14T16:36:01Z

cpp/src/parquet/arrow/reader.h

+  ///
+  /// Note: When reading multiple row groups there is no guarantee you will get one
+  /// record batch per row group.  Data from multiple row groups could get combined into
+  /// a single batch.


Interesting, and I agree it's probably desirable. Is it a deviation from other APIs?

No. This is the same way that the synchronous APIs behave.

pitrou · 2023-08-14T16:43:43Z

cpp/src/parquet/arrow/reader.cc

+              std::shared_ptr<ChunkedArray> chunked_array;
+              ARROW_RETURN_NOT_OK(
+                  column_reader->NextBatch(rows_in_batch, &chunked_array));
+              return chunked_array;


Not necessary for this PR, but we'd probably like a Result returning variant of NextBatch.

I added a result-returning variant.

pitrou · 2023-08-14T16:46:41Z

cpp/src/parquet/arrow/reader.cc

+                    "The setting allow_sliced_batches is set to false and data was "
+                    "encountered that was too large to fit in a single batch.");
+              }
+              state->overflow.push(std::move(next_batch));


I suppose this makes the generator not async-reentrant, since operator() might be called from one thread while this callback runs on another thread?

Good point. I've added some lines to the method doc to make this explicit. There isn't much point in trying to make this async-reentrant since that would require parallel reading of a row group and we don't support that. It might, in theory, be possible, but I think most users get enough parallelism from multiple files / multiple row groups.

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

pitrou · 2023-08-14T16:54:43Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  }
+
+  // Eaglerly free up memory
+  value.clear();


clear unfortunately leaves the vector capacity unchanged. Perhaps value = {} would work...

I put value in its own scope block.

Co-authored-by: mwish <[email protected]> Co-authored-by: Gang Wu <[email protected]>

…est that tested a few strings with very large values (instead of the existing test which tests many many small values) as this is the situation that actually triggers the error.

…the CPU thread pool. Updated to use the I/O thread pool for I/O

pitrou · 2023-08-18T19:42:38Z

@wgtmac Would you like to review this again?

wgtmac

Just some minor comments.

wgtmac · 2023-08-21T02:32:56Z

cpp/src/parquet/arrow/reader.h

@@ -249,6 +250,63 @@ class PARQUET_EXPORT FileReader {
  virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                        std::shared_ptr<::arrow::Table>* out) = 0;

+  using AsyncBatchGenerator =


cpp/src/arrow/util/async_generator_fwd.h has defined AsyncGenerator below, should we reuse it?

template <typename T> using AsyncGenerator = std::function<Future<T>()>;

cpp/src/parquet/arrow/reader.cc

wgtmac · 2023-08-21T03:39:37Z

cpp/src/parquet/arrow/reader.h

+  /// \param allow_sliced_batches if false, an error is raised if a batch has too much
+  ///                             data for the given batch size.  If false, smaller
+  ///                             batches will be returned instead.
+  virtual AsyncBatchGenerator ReadRowGroupsAsync(


It does not return arrow::Result, so the docstring would be good to include expected return value, especially when allow_sliced_batches is false.

Would Future in arrow actually include Status, so it would contain the arrow::Result?

I will add something here.

Would Future in arrow actually include Status?

Yes. A Future in Arrow has an implicit Status. So it is generally invalid to see something like Future<Result<T>>. There are a few places where we do return Result<Future<T>>. This typically indicates that "something can fail quickly, in the synchronous part" (e.g. the thread pool is already shut down) and any failure in the asynchronous portion will be communicated in the Future. However, I generally still prefer to just return Future<T> in these cases (there is a utility DeferNotOk which will convert Result<Future<T>> to Future<T>)

wgtmac · 2023-08-21T03:41:07Z

cpp/src/parquet/arrow/reader.h

@@ -316,6 +374,14 @@ class PARQUET_EXPORT ColumnReader {
  // the data available in the file.
  virtual ::arrow::Status NextBatch(int64_t batch_size,
                                    std::shared_ptr<::arrow::ChunkedArray>* out) = 0;
+
+  // Overload of NextBatch that returns a Result
+  virtual ::arrow::Result<std::shared_ptr<::arrow::ChunkedArray>> NextBatch(


Is it required to expose this?

I'll remove it.

wgtmac · 2023-08-21T04:57:40Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+    for (std::vector<int> columns : std::vector<std::vector<int>>{{}, {0}, {0, 1}}) {
+      ARROW_SCOPED_TRACE("# columns = ", columns.size());
+
+      for (int row_group_size : {128, 512, 1024, 2048}) {


nit: row_group_size -> batch_size

mapleFU

Rest LGTM!

mapleFU · 2023-08-21T05:17:24Z

cpp/src/parquet/arrow/reader.h

+  /// \param allow_sliced_batches if false, an error is raised if a batch has too much
+  ///                             data for the given batch size.  If false, smaller
+  ///                             batches will be returned instead.
+  virtual AsyncBatchGenerator ReadRowGroupsAsync(


Would Future in arrow actually include Status, so it would contain the arrow::Result?

mapleFU · 2023-08-21T05:20:50Z

cpp/src/parquet/arrow/reader.cc

+      int64_t batch_size, ::arrow::internal::Executor* io_executor,
+      ::arrow::internal::Executor* cpu_executor) final {
+    Future<> load_fut = ::arrow::DeferNotOk(
+        io_executor->Submit([this, batch_size] { return LoadBatch(batch_size); }));


Would it matter that io_executor would consume some CPU?

LeafReader::LoadBatch might decompress page, decode records, parse def-rep levels. They're all cpu-intensive.

Good observation. In a perfect world we would do all of that on a CPU thread. This would help to keep context switches to a minimum. The only work that would happen on the I/O thread would be the RandomAccessFile call to read the file. However, that requires pushing async further into the parquet code base which would be a lot of work. It's not clear that the benefit would be significant enough to require the work.

Yes I see it in comment, just saying that it might harm. Perfect IO might be so tricky and need to wrap RangeCache and RandomAccessFile. This looks good to me

Another point is that seems that LoadBatch would only depends on sync-io api? So it would not cause deadlock when LoadBatch waiting for page io? (which could happen when use_thread with previous scanner?)

mapleFU · 2023-08-21T05:22:10Z

cpp/src/parquet/arrow/reader.cc

+                std::shared_ptr<ChunkedArray> out;
+                RETURN_NOT_OK(BuildArray(batch_size, &out));
+                for (int x = 0; x < out->num_chunks(); x++) {
+                  RETURN_NOT_OK(out->chunk(x)->Validate());


(this is not related to this patch, but I want to ask, why it's neccessary to Validate it here?)

I'm only guessing as I didn't write the original implementation but my guess is that, for security reasons, it is almost always required to validate because a malicious user could otherwise craft a parquet file that triggers buffer overflow. For example, they could store a list array where one of the offsets is way out of range.

mapleFU · 2023-08-21T05:24:20Z

cpp/src/parquet/arrow/reader.cc

+              break;
+            }
+            if (first) {
+              if (!state->allow_sliced_batches) {


Should we add this in document?

I think this parameter (allow_sliced_batches) is documented. Am I misunderstanding?

Nope, that's my fault. The document looks good to me

Co-authored-by: Gang Wu <[email protected]>

mapleFU

Thanks!

mapleFU · 2023-08-24T19:04:50Z

cpp/src/parquet/arrow/reader.cc

+  auto generator_state = std::make_shared<AsyncBatchGeneratorState>();
+  generator_state->io_executor = reader_properties_.io_context().executor();
+  generator_state->cpu_executor = cpu_executor;
+  generator_state->use_threads = reader_properties_.use_threads();


Final question, it's not related to correctness, but it's a bit confusing, I've seen the comment:

/// This method ignores the use_threads property of the ArrowReaderProperties. It will /// always create a task for each column. To run without threads you should use a /// serial executor as the CPU executor.

So why is use_threads introduced here?

westonpace requested a review from wgtmac as a code owner July 19, 2023 22:30

westonpace mentioned this pull request Jul 19, 2023

GH-32566: [C++] Connect parquet to the new scan node #35889

Closed

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review labels Jul 19, 2023

westonpace requested a review from mapleFU July 20, 2023 05:10

mapleFU reviewed Jul 20, 2023

View reviewed changes

cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved

cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved

cpp/src/parquet/arrow/reader.h Show resolved Hide resolved

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

wgtmac reviewed Jul 20, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 20, 2023

mapleFU reviewed Jul 20, 2023

View reviewed changes

cpp/src/parquet/arrow/reader.cc Show resolved Hide resolved

cpp/src/parquet/arrow/reader.h Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 26, 2023

westonpace force-pushed the feature/parquet-read-row-groups-async branch from 7625dab to 5e38b00 Compare July 27, 2023 20:37

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 27, 2023

mapleFU approved these changes Jul 28, 2023

View reviewed changes

wgtmac approved these changes Jul 28, 2023

View reviewed changes

pitrou reviewed Aug 14, 2023

View reviewed changes

westonpace and others added 5 commits August 16, 2023 10:37

Adding a new asynchronous read method for parquet

a45ec0a

Added unit tests. Cleaned up comments

51b7e79

Apply suggestions from code review

17459b4

Co-authored-by: mwish <[email protected]> Co-authored-by: Gang Wu <[email protected]>

Addressed concerns from code review. Moved large string test into a t…

90f9578

…est that tested a few strings with very large values (instead of the existing test which tests many many small values) as this is the situation that actually triggers the error.

Addressing review comments

7d78ee9

westonpace force-pushed the feature/parquet-read-row-groups-async branch from 5e38b00 to 7d78ee9 Compare August 16, 2023 18:08

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 16, 2023

Realized the current implementation was actually entirely running on …

68e7b40

…the CPU thread pool. Updated to use the I/O thread pool for I/O

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2023

westonpace requested a review from pitrou August 18, 2023 14:34

wgtmac reviewed Aug 21, 2023

View reviewed changes

mapleFU reviewed Aug 21, 2023

View reviewed changes

Expand docs a little to make it more clear that use_threads is ignored

6b34304

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 24, 2023

Update cpp/src/parquet/arrow/reader.cc

5d2d9e9

Co-authored-by: Gang Wu <[email protected]>

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 24, 2023

mapleFU approved these changes Aug 24, 2023

View reviewed changes

mapleFU reviewed Aug 24, 2023

View reviewed changes

westonpace closed this Apr 16, 2024

GH-36778: [C++][Parquet] Add ReadRowGroupAsync/ReadRowGroupsAsync #36779

GH-36778: [C++][Parquet] Add ReadRowGroupAsync/ReadRowGroupsAsync #36779

Conversation

westonpace commented Jul 19, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 19, 2023

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU commented Jul 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Jul 20, 2023

mapleFU commented Jul 20, 2023

westonpace commented Jul 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Aug 18, 2023

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace commented Jul 19, 2023 •

edited by github-actions bot

Loading

mapleFU Aug 21, 2023 •

edited

Loading

westonpace Aug 24, 2023 •

edited

Loading

mapleFU Aug 21, 2023 •

edited

Loading

mapleFU Aug 24, 2023 •

edited

Loading

mapleFU Aug 24, 2023 •

edited

Loading