Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Streaming generator executor waits for item report to complete before continuing #44257

Merged
merged 3 commits into from
Mar 26, 2024

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Mar 23, 2024

Why are these changes needed?

#42260 updated streaming generator tasks to asynchronously report generator returns, instead of synchronously reporting each generator return before yielding the next return. However this has a couple problems:

  • If the task still has a reference to the yielded value, it may modify the value. The serialized and reported return will then have a different value than expected.
  • As per [core] Streaming generator task waits for all object report acks before finishing the task #44079, we need to track the number of in-flight RPCs to report generator returns, so that we can wait for them all to reply before we return from the end of the task. If we increment the count of in-flight RPCs asynchronously, we can end up returning from the task while there are still in-flight RPCs.

So this PR reverts some of the logic in #42260 to wait for the generator return to be serialized into the protobuf sent back to the caller. Note that we do not wait for the reply (unless under backpressure).

We can later re-introduce asynchronous generator reports, but we will need to evaluate the performance benefit of a new implementation that also addresses both of the above points.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 1446 to 1447
# we do not wait for the reply here (unless we are under
# backpressure).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify that back-pressure doesn't apply to RPC Actor calls (only Ray Data)

Signed-off-by: Stephanie Wang <[email protected]>
stephanie-wang added a commit that referenced this pull request Mar 26, 2024
…44197)

When a streaming generator task is cancelled, we should mark the end of the stream at the caller's current pointer into the stream. Otherwise, if we receive out-of-order item reports, we may end up hanging, because the cancelled task will never report the intermediate items. We may end up dropping some values that were already reported, but this is OK since the user already cancelled the task.

Closes #43852.

Note: #44257 is also enough to make the relevant flaky test stable, probably because it makes it less likely to produce out-of-order item reports. Meanwhile, this PR addresses the root cause of the flaky test, i.e. out-of-order item reports during task cancellation.

---------

Signed-off-by: Stephanie Wang <[email protected]>
@stephanie-wang stephanie-wang merged commit 11506ca into ray-project:master Mar 26, 2024
5 checks passed
@stephanie-wang stephanie-wang deleted the revert-42260 branch March 26, 2024 02:05
stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…ay-project#44197)

When a streaming generator task is cancelled, we should mark the end of the stream at the caller's current pointer into the stream. Otherwise, if we receive out-of-order item reports, we may end up hanging, because the cancelled task will never report the intermediate items. We may end up dropping some values that were already reported, but this is OK since the user already cancelled the task.

Closes ray-project#43852.

Note: ray-project#44257 is also enough to make the relevant flaky test stable, probably because it makes it less likely to produce out-of-order item reports. Meanwhile, this PR addresses the root cause of the flaky test, i.e. out-of-order item reports during task cancellation.

---------

Signed-off-by: Stephanie Wang <[email protected]>
stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
… before continuing (ray-project#44257)

ray-project#42260 updated streaming generator tasks to asynchronously report generator returns, instead of synchronously reporting each generator return before yielding the next return. However this has a couple problems:

    If the task still has a reference to the yielded value, it may modify the value. The serialized and reported return will then have a different value than expected.

    As per [core] Streaming generator task waits for all object report acks before finishing the task ray-project#44079, we need to track the number of in-flight RPCs to report generator returns, so that we can wait for them all to reply before we return from the end of the task. If we increment the count of in-flight RPCs asynchronously, we can end up returning from the task while there are still in-flight RPCs.

So this PR reverts some of the logic in ray-project#42260 to wait for the generator return to be serialized into the protobuf sent back to the caller. Note that we do not wait for the reply (unless under backpressure).

We can later re-introduce asynchronous generator reports, but we will need to evaluate the performance benefit of a new implementation that also addresses both of the above points.

---------

Signed-off-by: Stephanie Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants