-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972
[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972
Conversation
Signed-off-by: SangBin Cho <[email protected]>
Please review the PR when I comment! Need to wait for the test result. It seems promising when I run locally |
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
@@ -93,7 +93,8 @@ class TaskExecutor { | |||
const std::vector<ConcurrencyGroup> &defined_concurrency_groups, | |||
const std::string name_of_concurrency_group_to_execute, | |||
bool is_reattempt, | |||
bool is_streaming_generator); | |||
bool is_streaming_generator, | |||
bool retry_exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to pass these values because accessing worker_context_ from a async loop thread triggers the segfault. I think we should find a solution to deprecate worker_context since it doesn't work well with async actors.
attempt_number) | ||
generator_index += 1 | ||
|
||
cpdef report_streaming_generator_output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a refactored method (no logic changes)
cc @edoakes for the serve code approval |
Signed-off-by: SangBin Cho <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serve changes LGTM
Signed-off-by: SangBin Cho <[email protected]>
…le bug fix #38171 (#38280) Before #37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.
…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: NripeshN <[email protected]>
…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: NripeshN <[email protected]>
…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: e428265 <[email protected]>
…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: e428265 <[email protected]>
…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: Victor <[email protected]>
…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: Victor <[email protected]>
Why are these changes needed?
This PR fixes #37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).
The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR #37713, but it is too hacky).
anext
.Related issue number
Closes #37147
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.