Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix exit handling of FiberState threads #45834

Merged
merged 3 commits into from
Jun 11, 2024

Conversation

hongchaodeng
Copy link
Member

Why are these changes needed?

Currently fiber_runner_thread_.detach() is called in Fiber.Join(). But Fiber.Join() could be called multiple times. For example, ConcurrencyGroupManager could call Join() the same number of times as the number of workers. This could result in double detach() failure.

detach() should be called after fiber_runner_thread_ was created. That's the right pattern to use std thread.

Related issue number

Fix #45656

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Currently fiber_runner_thread_.detach() is called in Fiber.Join().
But Fiber.Join() could be called multiple times.
For example, ConcurrencyGroupManager could call Join() the same
number of times as the number of workers.
This could result in double detach() failure.

detach() should be called after fiber_runner_thread_ was created.
That's the right pattern to use std thread.

Signed-off-by: hongchaodeng <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented Jun 10, 2024

Can we add some tests?

@rynewang
Copy link
Contributor

We can add tests to src/ray/core_worker/test/fiber_state_test.cc by calling Stop() and Join() twice and it did not crash

@hongchaodeng hongchaodeng added the go add ONLY when ready to merge, run all tests label Jun 10, 2024
Signed-off-by: hongchaodeng <[email protected]>
@hongchaodeng
Copy link
Member Author

@jjyao @rynewang Fixed!

src/ray/core_worker/fiber.h Outdated Show resolved Hide resolved
Signed-off-by: hongchaodeng <[email protected]>
@@ -155,7 +157,6 @@ class FiberState {

void Join() {
fiber_stopped_event_->Wait();
fiber_runner_thread_.detach();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are relying on the behavior that channel_.close(); and fiber_stopped_event_->Wait(); can be called multiple times, which I checked is true. But this seems fragile and rely on the underlying behavior of these libraries. Can we just have our own stopped_ and joined_ flags and early return if they are already called? Thoughts @hongchaodeng @rynewang

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and fix the issue first.
We are basically guaranteeing something that the library do not provide -- thread cancellation.
That's orthogonal to enhancing the capabilities.

@@ -95,6 +95,14 @@ TEST(FiberStateTest, RespectsConcurrencyLimit) {
fiber_state.Join();
}

TEST(FiberStateTest, DoubleStopJoin) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the unit test, can we also add an e2e test using the repro script mentioned in the GH issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is worth it.
The original repro script is sort of red herring. This test covers the root of the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this unit test covers the root cause of this issue, but it's still nice to make sure user's workload work well e2e even if, for example, we remove fiber completely in the future(and this unit test will be irrelevant)

@jjyao jjyao merged commit 3f5679b into ray-project:master Jun 11, 2024
6 checks passed
@hongchaodeng hongchaodeng deleted the fix-fiberjoin branch June 11, 2024 17:46
richardsliu pushed a commit to richardsliu/ray that referenced this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] async actors do not terminate cleanly with __ray_terminate__
3 participants