-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
who_has not set for task in state fetch #5751
Comments
Looks like this is a regression introduced by #5653 |
This is a weird one, again.
The assertion is raised during a I encountered such a condition already but couldn't make sense of it at the time. This led to us filtering out the "self.address" in distributed/distributed/worker.py Lines 3195 to 3202 in 30ffa9c
since otherwise gather_dep would run in an endless loop trying to fetch data from itself (yes, that's possible 🤦 ) #4784 Will likely need to dig a bit deeper in what the scheduler thinks its doing |
There are a few things happening. While the worker B is trying to fetch data from worker A, worker A is killed such that the tasks are reassigned to worker B, i.e. this is a The assertionerror/deadlock appears iff the compute->release cycles overlap with a task-finished signal of a previous compute due to network delay. This will cause the dependent to be scheduled on workerB already even though workerB is still working off the initial task. Trying to work on the dependent will trigger its dependency to be transitioned to fetch but since the worker itself is the one "supposed to have it", it will appear empty. This stuff is difficult to write down, below a timeline of events with a few annotations Annotated timeline of events
Apart from this I noticed two more things behaving weirdly.
See below, this ordering depends on the insertion order to the recommendations dict
|
Closed by #5786 |
Known tests to be affected
In production environment validation is disabled which will likely cause other errors and/or deadlocks.
Task story
Full story
The text was updated successfully, but these errors were encountered: