Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add queue and gut out #5

Merged
merged 8 commits into from
Jul 29, 2024
Merged

Add queue and gut out #5

merged 8 commits into from
Jul 29, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Jul 29, 2024

This changeset will refactor the current setup to replace the kube-scheduler (schedulingCycle and bindingCycle) with fluxion. This includes a new database and queue manager strategy (described below, and content taken from: #2). There is still more detail work to do before merging, specifically we need to better coordinate the required in-tree plugins (that, for example, advise on volume binding) with fluxion, and we also need to more intelligently use reservations.

I realized that we need flexibility in defining queue strategies, not just in how the worker is designed, but also how the queue strategy handles the schedule function. The diagram below shows the current design, without some detail (but enough to understand I hope):

image

The queue strategy I'm starting with is FCFS with backfill, which is (sort of) what Kubernetes can do, assuming it would schedule groups without clogging (allowing smaller groups that can be scheduled to fill in). This work was completed in #2 and we can now move on to the next steps to tweak the details. See the README for some of the bullet list of things to think about and do.

The signal of the schedule -> binding works by way of subscriptions - we update worker args in the database with the node assignment, which can then call the binding. This currently assumes each group has homogeneous pods, but in the future does not have to.

vsoch and others added 8 commits July 26, 2024 07:02
This changeset adds a new queue to the fluxnetes in-tree plugin,
which currently knows how to accept a pod for work, and then
just sleep (basically reschedule for 5 seconds into the future).
This is not currently hooked into Kubernetes scheduling because
I want to develop the functionality I need first, in parallel,
before splicing it in. I should still be able to schedule to
Fluxion and trigger cleanup when the actual job is done. I
think we might do better to remove the group CRD too - it would
hugely simplify things (the in-tree plugin would barely need
anything aside from the fluxion interactions and queue) and
instead we can keep track of group names and counts (that are
still growing) in a separate table, since we already have postgres.
Two things I am not sure about include 1. the extent to which
in-tree plugins support scheduling. I can either keep them (and
then would need to integrate) or have their functionality move
into what fluxion can offer. I suspect they add supplementary
features since we were able to disable most of them. The second
thing I am not sure about (I will figure out) is, given that
we customize the plugin framework, where the right place to
put sort is. If we are adding pods to a table we will need to
store the same metadata (priority, timestamp, etc) to allow
for this equivalent sort.

Signed-off-by: vsoch <[email protected]>
This changeset creates separate worker and podgroup fluxnetes
package files, and they handle worker definition and pod group
parsing functions, respectively. Up to this point we can now
1. retrieve a new pod and see if it is in a group.
2. if no (size 1) add to worker queue immediatel.
   if yes (size N) add to pods table to be inspected later
3. retrieve the podspec in the work function
4. parse back into podspec and ask flux for the allocation.
I next need to do two things. First, figure out how to pass
the node assignment back to the scheduler - I am hoping
the job object "JobRow" can be modified to add metadata.
Then we need to write the function to run at the end of
a schedule cycle that moves groups from the provisional
table to the worker queue

Signed-off-by: vsoch <[email protected]>
I realized that we need flexibility in defining queue strategies,
not just in how the worker is designed, but also how the queue
strategy handles the schedule function. This is an overhaul (not
quite done yet) that does that. I stil need to plug the final
query back in to move provisional to the worker queue. Also
note that it looks like we have priority, pending, and other
insert params to play with.

Signed-off-by: vsoch <[email protected]>
This changeset includes a query that will update Args (node)
from within a worker job so we can send them back to the
scheduler. I am lastly working on the command so that the
initial query will move provisional pods (and groups) from
the provisional table to the worker queue

Signed-off-by: vsoch <[email protected]>
We do not need most of the fluence code-base with this new design.
I am keeping fluxnetes as an in-tree plugin because it seems like
an optimization to still be able to sort the initial queue of
pods coming in. I am also thinking it would be unwise to completely
remove in-tree plugins - they serve important purposes! But we do
need to more integelligently integrate them with this new design.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch merged commit 223854d into main Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant