[C++] Support order by derived column #31982

asfimport · 2022-05-23T15:04:40Z

You can't do the equivalent of

SELECT x, y FROM table ORDER BY x * y ASC;

because sorting requires a named column AND because sorting is only done in a SinkNode. You can project to {x, y, x*y} then sort on x*y, but you can't then project back to {x, y} on the sorted data because that's a new ExecPlan and order is not preserved. In R we have to handle this outside of an ExecPlan.

Reporter: Neal Richardson / @nealrichardson

_{Note: This issue was originally created as ARROW-16631. Please see the migration documentation for further details.}

### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>

nealrichardson mentioned this issue Mar 22, 2023

GH-34437: [R] Use FetchNode and OrderByNode #34685

Merged

thisisnic closed this as completed in #34685 Apr 11, 2023

thisisnic closed this as completed in 47a602d Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Support order by derived column #31982

[C++] Support order by derived column #31982

asfimport commented May 23, 2022

[C++] Support order by derived column #31982

[C++] Support order by derived column #31982

Comments

asfimport commented May 23, 2022