Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add ordering information to exec batches #32991

Open
asfimport opened this issue Sep 16, 2022 · 1 comment
Open

[C++] Add ordering information to exec batches #32991

asfimport opened this issue Sep 16, 2022 · 1 comment

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 16, 2022

I proposed this ages ago in https://lists.apache.org/thread/xpn9gyrs6kqc3g9t8k4ts8dmy7yyxskq and am finally getting around to implementing it.

I propose to add ordering information to exec nodes (mostly for node validation) and indices to exec batches. This is a fundamental step to allow nodes to consume this ordering information to achieve features such as ARROW-10883 and ARROW-16628. It also can replace the complicated batch enumeration in the current scanner to support in-order table reassembly.

Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace
Watchers: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-17762. Please see the migration documentation for further details.

@westonpace
Copy link
Member

westonpace commented Feb 6, 2023

I plan to start working on this task. However, it is quite large. I plan on breaking it up into the following tasks:

thisisnic pushed a commit that referenced this issue Apr 11, 2023
### Rationale for this change

See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* #34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* #34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: #34437
* Closes: #31980
* Closes: #31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants