Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeboxed push for simplifying work stealing #6993

Closed
2 of 4 tasks
fjetter opened this issue Sep 2, 2022 · 2 comments
Closed
2 of 4 tasks

Timeboxed push for simplifying work stealing #6993

fjetter opened this issue Sep 2, 2022 · 2 comments
Assignees
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better performance scheduler scheduling stability Issue or feature related to cluster stability (e.g. deadlock) stealing

Comments

@fjetter
Copy link
Member

fjetter commented Sep 2, 2022

Work stealing is a known source of problems. It's current implementation is overly complex and has a couple of known problems, some of which are almost fixed.

Specifically, I propose to time box this to ~1-2weeks and try to wrap up a few known issues while pushing for a drastic simplification of the implementation. Once the dust settles, we can reevaluate how this feature has to evolve.

The short to mid term target of this effort should be to reduce the number of steal requests drastically such that we can afford spending more time on "good" decisions (e.g. reusing the actual scheduler decide_worker logic or something even better)

Even if we want to get rid of work stealing entirely, there is some need for it to balance inhomogeneous workloads and allow cluster upscaling, see #6600 The most valuable component of the current implementation is the handshake mechanism move_task_request / move_task_confirm that ensures consistent transitions without recomputing a key. I believe by tearing down the infrastructure around this handshake piece by piece we can iterate towards a more stable and maintainable implementation.

Previously I approached changes to this logic very carefully due to the lack of repeatable benchmarks. Therefore, I suggest that this effort should utilize benchmarks in coiled-runtime to the best of our abilities.

@fjetter fjetter added enhancement Improve existing functionality or make things work better performance stability Issue or feature related to cluster stability (e.g. deadlock) scheduling stealing scheduler adaptive All things relating to adaptive scaling labels Sep 2, 2022
@fjetter fjetter self-assigned this Sep 2, 2022
@hendrikmakait hendrikmakait self-assigned this Sep 27, 2022
@hayesgb
Copy link

hayesgb commented Oct 17, 2022

#7075

@jrbourbeau
Copy link
Member

@hayesgb how does #7075 relate to this issue? Does it close this issue? Or perhaps it's just related work?

@fjetter fjetter closed this as completed Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better performance scheduler scheduling stability Issue or feature related to cluster stability (e.g. deadlock) stealing
Projects
None yet
Development

No branches or pull requests

4 participants