Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve work stealing deadlock caused by race in move_task_confirm #5379

Merged
merged 9 commits into from
Oct 14, 2021

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Oct 1, 2021

This refactors the move_task_confirm and resolves a few minor bugs

  • self.scheduler.remove_worker after an unccesseful event called the function improperly (argument mismatch) and would raise either way. Removing the worker is not necessary since the batchedcomm takes care of this and trigger all relevant key reschedulings
  • similarly the rescheduling trigger only needed to act if the steal was confirmed and is now a bit more precise

The most important bug is a deadlock caused by a concurrent race condition of stealing requests. I do not prohibit concurrent steals for the same key but rather make the system robust to concurrency if it should happen. The concurrency control I'm applying is an ETag like transaction confirmation using a stealing stimulus ID which is further passed through the transitions on worker side such that it can be traced. Every received response will now be logged other than before were certain code branches were ignored making debugging harder.

@fjetter
Copy link
Member Author

fjetter commented Oct 1, 2021

New tests seem to be a bit brittle and there is a cythonization error. Will need to look into it

@fjetter
Copy link
Member Author

fjetter commented Oct 5, 2021

Most of the test failures are unrelated

However, there appears to be a genuine failure in test_reschedule_concurrent_requests_deadlock I will need to investigate (timeout; probably a deadlock)

distributed/stealing.py Outdated Show resolved Hide resolved
"released",
None,
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these constant names generate confusion with Worker.status.
Could you change them e.g. to _TASKSTATE_STATE_CONFIRM etc.?

On a related note, this IMHO highlights that TaskState.state direly needs to become an Enum - it's just too easy to miss a state here, now or in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change them e.g. to _TASKSTATE_STATE_CONFIRM etc.?

sure

this IMHO highlights that TaskState.state direly needs to become an Enum

agreed. I would prefer doing this in another PR if you don't mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this outstanding request. @fjetter would you mind pushing up a separate PR with the updated name @crusaderky suggested?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I would prefer doing this in another PR

Yes obviously

@fjetter
Copy link
Member Author

fjetter commented Oct 13, 2021

Test failures appear to be unrelated. There is again the cythonization error #5406 which I will address in a follow up PR

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fjetter for fixing and @crusaderky for reviewing

@fjetter fjetter added deadlock The cluster appears to not make any progress stealing labels Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress stealing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants