[C++] Support limit operation #31980

asfimport · 2022-05-23T14:52:22Z

Either an option to a SinkNode (TopK already takes a number of results to keep) or a streaming LimitNode that only lets N rows through.

Reporter: Neal Richardson / @nealrichardson

_{Note: This issue was originally created as ARROW-16628. Please see the migration documentation for further details.}

### Rationale for this change See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * #34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * #34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: #34437 * Closes: #31980 * Closes: #31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>

### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>

nealrichardson mentioned this issue Apr 6, 2023

GH-34437: [R] Use FetchNode and OrderByNode #34685

Merged

thisisnic closed this as completed in #34685 Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Support limit operation #31980

[C++] Support limit operation #31980

asfimport commented May 23, 2022

[C++] Support limit operation #31980

[C++] Support limit operation #31980

Comments

asfimport commented May 23, 2022