[C++] Add ordering information to exec batches #32991

asfimport · 2022-09-16T23:27:38Z

I proposed this ages ago in https://lists.apache.org/thread/xpn9gyrs6kqc3g9t8k4ts8dmy7yyxskq and am finally getting around to implementing it.

I propose to add ordering information to exec nodes (mostly for node validation) and indices to exec batches. This is a fundamental step to allow nodes to consume this ordering information to achieve features such as ARROW-10883 and ARROW-16628. It also can replace the complicated batch enumeration in the current scanner to support in-order table reassembly.

Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace
Watchers: Rok Mihevc / @rok

Related issues:

[C++][Dataset] Preserve order when writing dataset (is required by)

PRs and other links:

GitHub Pull Request #14158

_{Note: This issue was originally created as ARROW-17762. Please see the migration documentation for further details.}

westonpace · 2023-02-06T23:54:20Z

I plan to start working on this task. However, it is quite large. I plan on breaking it up into the following tasks:

[C++] Create a fetch node based on a batch index property #34059 Add batch index to ExecNode, Create a fetch node (this will serve as an example of how a batch index can work)
[C++] Allow scanner to assert an ordering and/or support implicit ordering #34698 Add batch index to the scanner
[C++] Add the concept of "ordering" to an exec node, reject non-sensible plans #34136 Add the concept of "ordering" to an exec node, reject non-sensible plans
[C++] Add the concept of "ordering" to an exec node, reject non-sensible plans #34136 Update existing nodes to respect ordering and batch index (at this point "fetch" will be usable in larger plans)
GH-32763: [C++] Add FromProto for fetch & sort #34651 Add bindings for fetch to substrait consumer
GH-34248: [C++] Add an order_by node #34249 Add an order-by node which can modify an ordering
Deprecate old DeclarationToXyz methods

### Rationale for this change See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * #34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * #34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: #34437 * Closes: #31980 * Closes: #31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>

### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>

asfimport assigned westonpace Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++][Dataset] Preserve order when writing dataset #26818

Open

westonpace mentioned this issue Feb 6, 2023

[C++] Create a fetch node based on a batch index property #34059

Closed

jorisvandenbossche mentioned this issue Feb 21, 2023

ARROW-17762: [C++] WIP: Add ordering information to Acero #14158

Closed

nealrichardson mentioned this issue Mar 3, 2023

[R] Compute lagged or leading values #34201

Open

nealrichardson mentioned this issue Mar 22, 2023

GH-34437: [R] Use FetchNode and OrderByNode #34685

Merged

thisisnic mentioned this issue Jul 24, 2023

[R] Any support for rolling windows functions? #36849

Open

gitmodimo mentioned this issue Oct 9, 2024

GH-41706: [C++][Acero] Enhance asof_join to work in multi-threaded execution by sequencing input #44083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Add ordering information to exec batches #32991

[C++] Add ordering information to exec batches #32991

asfimport commented Sep 16, 2022 •

edited

Loading

westonpace commented Feb 6, 2023 •

edited

Loading

[C++] Add ordering information to exec batches #32991

[C++] Add ordering information to exec batches #32991

Comments

asfimport commented Sep 16, 2022 • edited Loading

Related issues:

PRs and other links:

westonpace commented Feb 6, 2023 • edited Loading

asfimport commented Sep 16, 2022 •

edited

Loading

westonpace commented Feb 6, 2023 •

edited

Loading