-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] fix exit handling of FiberState threads #45834
Conversation
Currently fiber_runner_thread_.detach() is called in Fiber.Join(). But Fiber.Join() could be called multiple times. For example, ConcurrencyGroupManager could call Join() the same number of times as the number of workers. This could result in double detach() failure. detach() should be called after fiber_runner_thread_ was created. That's the right pattern to use std thread. Signed-off-by: hongchaodeng <[email protected]>
Can we add some tests? |
We can add tests to |
Signed-off-by: hongchaodeng <[email protected]>
Signed-off-by: hongchaodeng <[email protected]>
@@ -155,7 +157,6 @@ class FiberState { | |||
|
|||
void Join() { | |||
fiber_stopped_event_->Wait(); | |||
fiber_runner_thread_.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we are relying on the behavior that channel_.close();
and fiber_stopped_event_->Wait();
can be called multiple times, which I checked is true. But this seems fragile and rely on the underlying behavior of these libraries. Can we just have our own stopped_
and joined_
flags and early return if they are already called? Thoughts @hongchaodeng @rynewang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this and fix the issue first.
We are basically guaranteeing something that the library do not provide -- thread cancellation.
That's orthogonal to enhancing the capabilities.
@@ -95,6 +95,14 @@ TEST(FiberStateTest, RespectsConcurrencyLimit) { | |||
fiber_state.Join(); | |||
} | |||
|
|||
TEST(FiberStateTest, DoubleStopJoin) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the unit test, can we also add an e2e test using the repro script mentioned in the GH issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is worth it.
The original repro script is sort of red herring. This test covers the root of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this unit test covers the root cause of this issue, but it's still nice to make sure user's workload work well e2e even if, for example, we remove fiber completely in the future(and this unit test will be irrelevant)
Signed-off-by: hongchaodeng <[email protected]> Signed-off-by: Richard Liu <[email protected]>
Why are these changes needed?
Currently
fiber_runner_thread_.detach()
is called inFiber.Join()
. ButFiber.Join()
could be called multiple times. For example,ConcurrencyGroupManager
could callJoin()
the same number of times as the number of workers. This could result in doubledetach()
failure.detach()
should be called afterfiber_runner_thread_
was created. That's the right pattern to use std thread.Related issue number
Fix #45656
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.