Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass on in-flight transfers if they are already released #6199

Closed
wants to merge 1 commit into from

Conversation

mrocklin
Copy link
Member

A task that is released while still in-flight will stick around in the
in_flight_workers state. This can cause a KeyError when gather_dep goes
to handle this now-missing-task.

The correct thing to do in this case is to just ignore the key,
which is what this commit does.

We could also keep in_flight_workers up-to-date on a release transition.
I'm not sure how valuable this would be apart from passing.
I'm open to either approach. This was the easiest.

Fixes #6194

A task that is released while still in-flight will stick around in the
in_flight_workers state.  This can cause a KeyError when gather_dep goes
to handle this now-missing-task.

The correct thing to do in this case is to just ignore the key,
which is what this commit does.

We could also keep in_flight_workers up-to-date on a release transition.
I'm not sure how valuable this would be apart from passing.
I'm open to either approach.  This was the easiest.

Fixes dask#6194
@github-actions
Copy link
Contributor

Unit Test Results

       15 files   -        1         15 suites   - 1   7h 14m 38s ⏱️ - 3m 32s
  2 731 tests ±       0    2 648 ✔️  -        1       80 💤  -   2  3 +3 
20 543 runs   - 1 190  19 527 ✔️  - 1 160  1 012 💤  - 34  4 +4 

For more details on these failures, see this check.

Results for commit d5eec32. ± Comparison against base commit 198522b.

Comment on lines -3105 to +3112
ts = self.tasks[d]
try:
ts = self.tasks[d]
except KeyError:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks are not allowed to be released while in flight. This is why I introduced the cancelled state. Releasing at this stage, i.e. forgetting can lead to all sorts of race conditions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you make of the attached story? It seems like they are being forgotten in between being placed in in_flight_workers and this stage (there is an await in there, during which I expect pretty much anything can happen). It seems like they're getting cancelled, then resumed, then released and forgotten all in the time span created by the await above.

("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'flight', 'released', 'cancelled', {}, 'processing-released-1650891186.0754924', 1650891186.16295)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'compute-task', 'compute-task-1650891186.392905', 1650891186.4060445)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'waiting', 'cancelled', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": ('resumed', 'waiting')}, 'compute-task-1650891186.392905', 1650891186.40608)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'resumed', 'resumed', {}, 'compute-task-1650891186.392905', 1650891186.406102)
('free-keys', ("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)",), 'processing-released-1650891188.465502', 1650891188.5917683)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'release-key', 'processing-released-1650891188.465502', 1650891188.5917764)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'resumed', 'released', 'released', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": 'forgotten'}, 'processing-released-1650891188.465502', 1650891188.5917976)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'released', 'forgotten', 'forgotten', {}, 'processing-released-1650891188.465502', 1650891188.591806)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that story is very helpful. There are a couple of signals

  • Fetch Task (-> flight; not shown)
  • Release -> Cancel (i.e. safe guard against forgetting)
  • Compute Task (-> resumed task)
  • Free keys

This free keys is recommending a release of the task which in turn triggers a resumed -> released transition which has no safe guard implemented to verify that the task isn't indeed still being used. I.e.

("resumed", "released"): self.transition_generic_released,

should not be using transition_generic_released but a specialized version.

That should even be reasonably easy to reproduce

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. As a heads-up I'm pausing work here to go play with #6206 . The extent to which you can take this over would be welcome.

The full story is in the linked issue if that provides any more insight.

@fjetter
Copy link
Member

fjetter commented Apr 27, 2022

Closing in favor of #6217

@fjetter fjetter closed this Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError in gather_dep
2 participants