Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix potential dead lock #17396

Merged
merged 1 commit into from
Jul 30, 2021
Merged

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Jul 28, 2021

absl complains there are potential deadlocks when running in debug mode.
after reading the stack it turns out we have following lock acquiring order:

  • in CoreWorker::InternalHeartbeat, we first acquire and hold lock CoreWorker.mutex_, then we acquire and hold CoreWorkerDirectActorTaskSubmitter.mutex_ via CoreWorkerDirectActorTaskSubmitter.SubmitTask call.
  • however, in CoreWorkerDirectActorTaskSubmitter.PushActorTask, we first acquire and hold lock CoreWorkerDirectActorTaskSubmitter.mutex_, and then we hold CoreWorker.mutex_ through TaskManager's retry_task_callback_

This leads to potential dead locks if those two call path happens in different threads. This PR fix it by releasing the lock before calling to SubmitTask in CoreWorker::InternalHeartbeat

Test Plan

  • verified debug test no longer complains potential deadlocks.

@scv119 scv119 linked an issue Jul 28, 2021 that may be closed by this pull request
@ericl
Copy link
Contributor

ericl commented Jul 28, 2021

Hmm the number of failing tests is suspicious, trigger a rerun?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2021
@scv119 scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 29, 2021
@rkooo567
Copy link
Contributor

@scv119 btw, we probably need to verify this before merging it? (how do we know this fixes the issue now? )

@scv119
Copy link
Contributor Author

scv119 commented Jul 29, 2021 via email

@rkooo567 rkooo567 added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jul 30, 2021
@scv119
Copy link
Contributor Author

scv119 commented Jul 30, 2021

@rkooo567 maybe we can merge this as we it will take a while to consolidate the debug-asan test? I've done manual tests and verified the potential dead-lock crash fixed on my laptop.

@scv119 scv119 removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jul 30, 2021
@ericl ericl merged commit 32803b5 into ray-project:master Jul 30, 2021
@rkooo567
Copy link
Contributor

SGTM. Btw Eric already merged it haha

stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Fix potential deadlocks in ray core
4 participants