Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue-manager: reorganize into strategies #2

Merged
merged 2 commits into from
Jul 29, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Jul 28, 2024

I realized that we need flexibility in defining queue strategies, not just in how the worker is designed, but also how the queue strategy handles the schedule function. This is an overhaul (not quite done yet) that does that. I stil need to plug the final query back in to move provisional to the worker queue. Also note that it looks like we have priority, pending, and other insert params to play with. And since things get lost in slack, here is a visual of the design:

image

The queue strategy I'm starting with is FCFS with backfill, which is (sort of) what Kubernetes can do, assuming it would schedule groups without clogging (allowing smaller groups that can be scheduled to fill in). This work is almost done - I need to finish the query to select the provisional pods that have groups at quorum, and then add them to the worker queues. I've already tested this step - once a group hits the worker queue, at least for this strategy, that is where we call "AskFlux" to do an allocation. It's FCFS with backfill because that allocation request can be denied if resources aren't ready, the job will go back into the queue, and the next group will be retried.

The events (subscriptions) are also working, and by updating args with the node assignment this is how we will send the signal back to the scheduler, and then call the binding. I haven't yet removed the original fluence in tree design, but that is happening slowly, and when the functionality is fully working here, I will remove it entirely in favor of that. I will need to think about how to properly handle current in tree plugins, because two different scheduling strategies doesn't make sense. My hope is that I can move the functionality of current (essential) in tree plugins to work in our new framework, whatever that might look like. 👀

Note that this branch goes into another branch that doesn't have a PR open yet.

Needs before merge here

  • Strategy to send node list and job id back to subscribers (scheduler)
  • Update queryReady query to select only provisional pods for which groups are fully assembled

I realized that we need flexibility in defining queue strategies,
not just in how the worker is designed, but also how the queue
strategy handles the schedule function. This is an overhaul (not
quite done yet) that does that. I stil need to plug the final
query back in to move provisional to the worker queue. Also
note that it looks like we have priority, pending, and other
insert params to play with.

Signed-off-by: vsoch <[email protected]>
This changeset includes a query that will update Args (node)
from within a worker job so we can send them back to the
scheduler. I am lastly working on the command so that the
initial query will move provisional pods (and groups) from
the provisional table to the worker queue

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch merged commit 7add491 into add-queue-and-gut-out Jul 29, 2024
@vsoch vsoch deleted the reorganize-queue-manager branch July 29, 2024 07:06
@vsoch vsoch restored the reorganize-queue-manager branch July 29, 2024 07:06
@vsoch vsoch mentioned this pull request Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant