Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support order by derived column #31982

Closed
asfimport opened this issue May 23, 2022 · 0 comments · Fixed by #34685
Closed

[C++] Support order by derived column #31982

asfimport opened this issue May 23, 2022 · 0 comments · Fixed by #34685

Comments

@asfimport
Copy link
Collaborator

You can't do the equivalent of

SELECT x, y FROM table ORDER BY x * y ASC;

because sorting requires a named column AND because sorting is only done in a SinkNode. You can project to {x, y, x*y} then sort on x*y, but you can't then project back to {x, y} on the sorted data because that's a new ExecPlan and order is not preserved. In R we have to handle this outside of an ExecPlan.

Reporter: Neal Richardson / @nealrichardson

Note: This issue was originally created as ARROW-16631. Please see the migration documentation for further details.

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant