[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

rkooo567 · 2023-08-01T13:31:13Z

Why are these changes needed?

This PR fixes #37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext).

The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR #37713, but it is too hacky).

Putting the whole generator task into an event loop instead of dispatching individual anext.
This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary arguments.

Related issue number

Closes #37147

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 · 2023-08-01T13:33:29Z

Please review the PR when I comment! Need to wait for the test result. It seems promising when I run locally

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 · 2023-08-02T07:35:05Z

cpp/src/ray/runtime/task/task_executor.h

@@ -93,7 +93,8 @@ class TaskExecutor {
      const std::vector<ConcurrencyGroup> &defined_concurrency_groups,
      const std::string name_of_concurrency_group_to_execute,
      bool is_reattempt,
-      bool is_streaming_generator);
+      bool is_streaming_generator,
+      bool retry_exception);


had to pass these values because accessing worker_context_ from a async loop thread triggers the segfault. I think we should find a solution to deprecate worker_context since it doesn't work well with async actors.

rkooo567 · 2023-08-02T07:45:02Z

python/ray/_raylet.pyx

-                attempt_number)
-            generator_index += 1
+
+cpdef report_streaming_generator_output(


this is a refactored method (no logic changes)

rkooo567 · 2023-08-02T08:41:00Z

cc @edoakes for the serve code approval

Signed-off-by: SangBin Cho <[email protected]>

edoakes

serve changes LGTM

python/ray/tests/test_streaming_generator.py

python/ray/_raylet.pyx

Signed-off-by: SangBin Cho <[email protected]>

…le bug fix #38171 (#38280) Before #37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that.

…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: NripeshN <[email protected]>

…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: NripeshN <[email protected]>

…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: e428265 <[email protected]>

…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: e428265 <[email protected]>

…new task name ray-project#37713 (ray-project#37972) This PR fixes ray-project#37147 by dispatching the whole generator task into an event loop (instead of dispatching individual anext). The PR could have a slight performance impact because the task output serialization code is inside the event loop unlike before (the approach to avoid this was tried in this PR ray-project#37713, but it is too hacky). Putting the whole generator task into an event loop instead of dispatching individual anext. This means some of core APIs are called inside an event loop. Had to remove the usage of worker_context because it is not working well when it is called inside a different thread (event loop thread). Instead we pass necessary argument Signed-off-by: Victor <[email protected]>

…le bug fix ray-project#38171 (ray-project#38280) Before ray-project#37972, we ran the reporting & serilization output (in cpp) in a main thread while all the async actor tasks run in an async thread. However, after the PR, we now run both of them in an async thread. This caused regression when there are decently large size (200~2KB) generator workloads (Aviary) because the serialization code was running with nogil. It means we could utilize real multi-threading because serialization code runs in a main thread, and async actor code runs in an async thread. This PR fixes the issue by dispatching a cpp code (reporting & serialization) to a separate thread again. I also found when I used threadPoolExecutor, there were some circular dependencies issues where it leaks objects when exceptions happen. I realized this was due to the fact that Python exception captures the local references (thus there were some circular references). I refactored some part of code to avoid this from happening and added an unit test for that. Signed-off-by: Victor <[email protected]>

Working now.

c69a331

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 assigned jjyao and edoakes and unassigned jjyao Aug 1, 2023

rkooo567 added 2 commits August 1, 2023 09:00

Fix a cpp bug.

2321189

Signed-off-by: SangBin Cho <[email protected]>

fix a build failure.

c039ad8

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 mentioned this pull request Aug 1, 2023

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713

Closed

8 tasks

rkooo567 added 2 commits August 1, 2023 16:07

build failure fix

9826bf8

Signed-off-by: SangBin Cho <[email protected]>

clean up code.

997cbd0

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 commented Aug 2, 2023

View reviewed changes

rkooo567 assigned jjyao Aug 2, 2023

make test less flaky on windows.

5ab15bd

Signed-off-by: SangBin Cho <[email protected]>

edoakes approved these changes Aug 2, 2023

View reviewed changes

python/ray/tests/test_streaming_generator.py Show resolved Hide resolved

jjyao approved these changes Aug 3, 2023

View reviewed changes

python/ray/_raylet.pyx Outdated Show resolved Hide resolved

rkooo567 added 2 commits August 3, 2023 01:56

Merge branch 'master' into streaming-serve-handle-bug-fix

0ad43a6

address comments

ad40d09

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 merged commit 178ae0f into ray-project:master Aug 4, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

rkooo567 commented Aug 1, 2023

rkooo567 commented Aug 1, 2023

rkooo567 Aug 2, 2023

rkooo567 Aug 2, 2023

rkooo567 commented Aug 2, 2023

edoakes left a comment

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

[Core][Streaming Generator] Fix a bug where each yield will create a new task name #37713 #37972

Conversation

rkooo567 commented Aug 1, 2023

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Aug 1, 2023

rkooo567 Aug 2, 2023

Choose a reason for hiding this comment

rkooo567 Aug 2, 2023

Choose a reason for hiding this comment

rkooo567 commented Aug 2, 2023

edoakes left a comment

Choose a reason for hiding this comment