Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

Merged

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Aug 1, 2023

Why are these changes needed?

This PR fixes #37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).

The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR #37713, but it is too hacky).

  • Putting the whole generator task into an event loop instead of dispatching individual anext.
  • This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary arguments.

Related issue number

Closes #37147

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>
@rkooo567 rkooo567 assigned jjyao and edoakes and unassigned jjyao Aug 1, 2023
@rkooo567
Copy link
Contributor Author

rkooo567 commented Aug 1, 2023

Please review the PR when I comment! Need to wait for the test result. It seems promising when I run locally

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
@@ -93,7 +93,8 @@ class TaskExecutor {
const std::vector<ConcurrencyGroup> &defined_concurrency_groups,
const std::string name_of_concurrency_group_to_execute,
bool is_reattempt,
bool is_streaming_generator);
bool is_streaming_generator,
bool retry_exception);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to pass these values because accessing worker_context_ from a async loop thread triggers the segfault. I think we should find a solution to deprecate worker_context since it doesn't work well with async actors.

attempt_number)
generator_index += 1

cpdef report_streaming_generator_output(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a refactored method (no logic changes)

@rkooo567
Copy link
Contributor Author

rkooo567 commented Aug 2, 2023

cc @edoakes for the serve code approval

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serve changes LGTM

python/ray/tests/test_streaming_generator.py Show resolved Hide resolved
python/ray/_raylet.pyx Outdated Show resolved Hide resolved
@rkooo567 rkooo567 merged commit 178ae0f into ray-project:master Aug 4, 2023
2 checks passed
rkooo567 added a commit that referenced this pull request Aug 13, 2023
…le bug fix #38171 (#38280)

Before #37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread.

This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread.

This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…new task name ray-project#37713 (ray-project#37972)

This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).

The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky).

Putting the whole generator task into an event loop instead of dispatching individual anext.
This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument

Signed-off-by: NripeshN <[email protected]>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…le bug fix ray-project#38171 (ray-project#38280)

Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread.

This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread.

This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.

Signed-off-by: NripeshN <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…new task name ray-project#37713 (ray-project#37972)

This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).

The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky).

Putting the whole generator task into an event loop instead of dispatching individual anext.
This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument

Signed-off-by: e428265 <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…le bug fix ray-project#38171 (ray-project#38280)

Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread.

This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread.

This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.

Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…new task name ray-project#37713 (ray-project#37972)

This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).

The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky).

Putting the whole generator task into an event loop instead of dispatching individual anext.
This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument

Signed-off-by: Victor <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…le bug fix ray-project#38171 (ray-project#38280)

Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread.

This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread.

This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[streaming] Current task being changed during async iteration
3 participants