-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17762: [C++] WIP: Add ordering information to Acero #14158
ARROW-17762: [C++] WIP: Add ordering information to Acero #14158
Conversation
|
…c generators to do scanning. Formally defined a new scan options and interfaces for schema evolution.
caf4a9f
to
467f26f
Compare
@rtpsw FYI, I think you were interested in my plans for ordered execution. This PR is based on my earlier proposal I sent to the ML. I plan to create an example fetch node that consumes the ordering information to do something useful today. This PR is built on top of ARROW-17287 and so it is a little easier to look at just the diff between the two: |
Thanks, @westonpace. I'm interested though will need a couple of days to get to this. |
@rtpsw I've added an example of a FetchNode that consumes the ordering information. This will hopefully give some idea on how to use the exec batch index.
There is no rush. I probably won't get back to this myself for a while as I need to get #13782 (and numerous follow-ups) merged in. |
I care about this work very much as well and hope can understand this better. If I remember correctly the high level idea is that there are nodes that requires ordering (e.g., asof join) and if the input batches are out of order (indicated by batch index), the consumer node will cache/reorder out of order batches before processing them? |
Yes. If a node relies on ordering then it will resequence the batches before processing them. I try and take care to use both "reorder" and "resequence" independently as there are two rather different problems. The first problem is when the input has no known ordering or is in a completely random order. In that case we must "reorder" which is "not streaming" and a "pipeline breaker" and requires us to cache all data in memory (or spill) in order to assign the order. The second problem is when the input is mostly ordered but might be a bit noisy due to something like a parallel scan. In that case we already have a sequence number and we assume the sequence number is, generally, within some max delta from the correct ordering. In that case we only need to resequence (not reorder). This operation is "mostly streaming" and only sometimes a "pipeline breaker". |
This feature is exactly what we need to adapt Acero. I tried to add ExecBatch ordering and implemented the limit operator in our product. Here is what we saw in the tests.
Anyway, this proposal is critical to our using Acero. We are looking forward to its release. |
This is being closed in favor of other PRs as listed in the issue #32991 |
No description provided.