ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

westonpace · 2022-09-16T23:29:09Z

No description provided.

github-actions · 2022-09-16T23:29:27Z

https://issues.apache.org/jira/browse/ARROW-17762

github-actions · 2022-09-16T23:29:28Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

…c generators to do scanning. Formally defined a new scan options and interfaces for schema evolution.

…nternal.h

…to exec batches

westonpace · 2022-09-19T12:08:39Z

@rtpsw FYI, I think you were interested in my plans for ordered execution. This PR is based on my earlier proposal I sent to the ML. I plan to create an example fetch node that consumes the ordering information to do something useful today.

This PR is built on top of ARROW-17287 and so it is a little easier to look at just the diff between the two:

westonpace/arrow@feature/ARROW-17287--initial-exec-plan-scan-node...westonpace:arrow:feature/ARROW-17762--add-ordering

rtpsw · 2022-09-20T17:04:50Z

Thanks, @westonpace. I'm interested though will need a couple of days to get to this.

…information

westonpace · 2022-09-20T18:18:12Z

@rtpsw I've added an example of a FetchNode that consumes the ordering information. This will hopefully give some idea on how to use the exec batch index.

Thanks, @westonpace. I'm interested though will need a couple of days to get to this.

There is no rush. I probably won't get back to this myself for a while as I need to get #13782 (and numerous follow-ups) merged in.

icexelloss · 2022-09-29T18:33:33Z

I care about this work very much as well and hope can understand this better. If I remember correctly the high level idea is that there are nodes that requires ordering (e.g., asof join) and if the input batches are out of order (indicated by batch index), the consumer node will cache/reorder out of order batches before processing them?

westonpace · 2022-10-03T14:11:53Z

I care about this work very much as well and hope can understand this better. If I remember correctly the high level idea is that there are nodes that requires ordering (e.g., asof join) and if the input batches are out of order (indicated by batch index), the consumer node will cache/reorder out of order batches before processing them?

Yes. If a node relies on ordering then it will resequence the batches before processing them. I try and take care to use both "reorder" and "resequence" independently as there are two rather different problems.

The first problem is when the input has no known ordering or is in a completely random order. In that case we must "reorder" which is "not streaming" and a "pipeline breaker" and requires us to cache all data in memory (or spill) in order to assign the order.

The second problem is when the input is mostly ordered but might be a bit noisy due to something like a parallel scan. In that case we already have a sequence number and we assume the sequence number is, generally, within some max delta from the correct ordering. In that case we only need to resequence (not reorder). This operation is "mostly streaming" and only sometimes a "pipeline breaker".

zifengyu · 2022-11-24T03:09:29Z

This feature is exactly what we need to adapt Acero. I tried to add ExecBatch ordering and implemented the limit operator in our product. Here is what we saw in the tests.

It seems a little difficult to finish the node (and notify downstream node) as the input / output batch counts are not the same. In our case, the finish may happen either when having the limit number of rows or upstream node is finished producing (but not generated limit rows). The former occurs in Queue's deliver task while latter occurs in FetchNode's InputFinished. We did not find an easy way to sync these two components so we moved the queue part inside node and added a counter to track sent rows.
We also need the offset setting to skip the first a few rows in the limit operator. Can this be included in FetchNode so we may switch back to Acero node in future?

Anyway, this proposal is critical to our using Acero. We are looking forward to its release.

jorisvandenbossche · 2023-02-21T20:54:15Z

This is being closed in favor of other PRs as listed in the issue #32991

github-actions bot added the Component: C++ label Sep 16, 2022

westonpace added 9 commits September 16, 2022 16:43

ARROW-17287: Initial creation of a "scan node" which doesn't use asyn…

aad3927

…c generators to do scanning. Formally defined a new scan options and interfaces for schema evolution.

ARROW-17287: Lint

d06043b

ARROW-17287: Updated scanner benchmark to reflect latest changes

5fe3479

ARROW-17287: Forgot to add a name to a lock_guard

c60d927

ARROW-17287: nullptr -> NULLPTR in header file

468f3a6

ARROW-17287: Trying to work around duplicate symbol errors

3d5b259

ARROW-17287: Cleanup after rebase

f281aab

ARROW-17287: Working around duplicate symbol errors from expression_i…

5f65bce

…nternal.h

ARROW-17762: Add ordering information to exec nodes and add an index …

467f26f

…to exec batches

westonpace force-pushed the feature/ARROW-17762--add-ordering branch from caf4a9f to 467f26f Compare September 19, 2022 12:03

westonpace added 2 commits September 20, 2022 10:30

ARROW-17762: Added a fetch node demonstrating how to use batch index …

aad5646

…information

ARROW-17762: Documented the API

6410f51

asfimport mentioned this pull request Nov 24, 2022

[C++] Add ordering information to exec batches #32991

Open

westonpace closed this Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

westonpace commented Sep 16, 2022

github-actions bot commented Sep 16, 2022

github-actions bot commented Sep 16, 2022

westonpace commented Sep 19, 2022

rtpsw commented Sep 20, 2022

westonpace commented Sep 20, 2022

icexelloss commented Sep 29, 2022

westonpace commented Oct 3, 2022

zifengyu commented Nov 24, 2022

jorisvandenbossche commented Feb 21, 2023

ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

Conversation

westonpace commented Sep 16, 2022

github-actions bot commented Sep 16, 2022

github-actions bot commented Sep 16, 2022

westonpace commented Sep 19, 2022

rtpsw commented Sep 20, 2022

westonpace commented Sep 20, 2022

icexelloss commented Sep 29, 2022

westonpace commented Oct 3, 2022

zifengyu commented Nov 24, 2022

jorisvandenbossche commented Feb 21, 2023