-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass on in-flight transfers if they are already released #6199
Conversation
A task that is released while still in-flight will stick around in the in_flight_workers state. This can cause a KeyError when gather_dep goes to handle this now-missing-task. The correct thing to do in this case is to just ignore the key, which is what this commit does. We could also keep in_flight_workers up-to-date on a release transition. I'm not sure how valuable this would be apart from passing. I'm open to either approach. This was the easiest. Fixes dask#6194
ts = self.tasks[d] | ||
try: | ||
ts = self.tasks[d] | ||
except KeyError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks are not allowed to be released while in flight. This is why I introduced the cancelled
state. Releasing at this stage, i.e. forgetting can lead to all sorts of race conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you make of the attached story? It seems like they are being forgotten in between being placed in in_flight_workers
and this stage (there is an await in there, during which I expect pretty much anything can happen). It seems like they're getting cancelled, then resumed, then released and forgotten all in the time span created by the await
above.
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'flight', 'released', 'cancelled', {}, 'processing-released-1650891186.0754924', 1650891186.16295)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'compute-task', 'compute-task-1650891186.392905', 1650891186.4060445)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'waiting', 'cancelled', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": ('resumed', 'waiting')}, 'compute-task-1650891186.392905', 1650891186.40608)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'resumed', 'resumed', {}, 'compute-task-1650891186.392905', 1650891186.406102)
('free-keys', ("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)",), 'processing-released-1650891188.465502', 1650891188.5917683)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'release-key', 'processing-released-1650891188.465502', 1650891188.5917764)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'resumed', 'released', 'released', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": 'forgotten'}, 'processing-released-1650891188.465502', 1650891188.5917976)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'released', 'forgotten', 'forgotten', {}, 'processing-released-1650891188.465502', 1650891188.591806)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that story is very helpful. There are a couple of signals
- Fetch Task (-> flight; not shown)
- Release -> Cancel (i.e. safe guard against forgetting)
- Compute Task (-> resumed task)
- Free keys
This free keys is recommending a release of the task which in turn triggers a resumed -> released transition which has no safe guard implemented to verify that the task isn't indeed still being used. I.e.
distributed/distributed/worker.py
Line 545 in 1adbb7a
("resumed", "released"): self.transition_generic_released, |
should not be using transition_generic_released
but a specialized version.
That should even be reasonably easy to reproduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. As a heads-up I'm pausing work here to go play with #6206 . The extent to which you can take this over would be welcome.
The full story is in the linked issue if that provides any more insight.
Closing in favor of #6217 |
A task that is released while still in-flight will stick around in the
in_flight_workers state. This can cause a KeyError when gather_dep goes
to handle this now-missing-task.
The correct thing to do in this case is to just ignore the key,
which is what this commit does.
We could also keep in_flight_workers up-to-date on a release transition.
I'm not sure how valuable this would be apart from passing.
I'm open to either approach. This was the easiest.
Fixes #6194