-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add queue and gut out #5
Commits on Jul 26, 2024
-
feat: new queue to handle groups
This changeset adds a new queue to the fluxnetes in-tree plugin, which currently knows how to accept a pod for work, and then just sleep (basically reschedule for 5 seconds into the future). This is not currently hooked into Kubernetes scheduling because I want to develop the functionality I need first, in parallel, before splicing it in. I should still be able to schedule to Fluxion and trigger cleanup when the actual job is done. I think we might do better to remove the group CRD too - it would hugely simplify things (the in-tree plugin would barely need anything aside from the fluxion interactions and queue) and instead we can keep track of group names and counts (that are still growing) in a separate table, since we already have postgres. Two things I am not sure about include 1. the extent to which in-tree plugins support scheduling. I can either keep them (and then would need to integrate) or have their functionality move into what fluxion can offer. I suspect they add supplementary features since we were able to disable most of them. The second thing I am not sure about (I will figure out) is, given that we customize the plugin framework, where the right place to put sort is. If we are adding pods to a table we will need to store the same metadata (priority, timestamp, etc) to allow for this equivalent sort. Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8047000 - Browse repository at this point
Copy the full SHA 8047000View commit details
Commits on Jul 28, 2024
-
worker: retrieval of podspec and AskFlux
This changeset creates separate worker and podgroup fluxnetes package files, and they handle worker definition and pod group parsing functions, respectively. Up to this point we can now 1. retrieve a new pod and see if it is in a group. 2. if no (size 1) add to worker queue immediatel. if yes (size N) add to pods table to be inspected later 3. retrieve the podspec in the work function 4. parse back into podspec and ask flux for the allocation. I next need to do two things. First, figure out how to pass the node assignment back to the scheduler - I am hoping the job object "JobRow" can be modified to add metadata. Then we need to write the function to run at the end of a schedule cycle that moves groups from the provisional table to the worker queue Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 633bf36 - Browse repository at this point
Copy the full SHA 633bf36View commit details -
queue-manager: reorganize into strategies
I realized that we need flexibility in defining queue strategies, not just in how the worker is designed, but also how the queue strategy handles the schedule function. This is an overhaul (not quite done yet) that does that. I stil need to plug the final query back in to move provisional to the worker queue. Also note that it looks like we have priority, pending, and other insert params to play with. Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 69e7624 - Browse repository at this point
Copy the full SHA 69e7624View commit details
Commits on Jul 29, 2024
-
notify: add working events to send nodes
This changeset includes a query that will update Args (node) from within a worker job so we can send them back to the scheduler. I am lastly working on the command so that the initial query will move provisional pods (and groups) from the provisional table to the worker queue Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 73d587d - Browse repository at this point
Copy the full SHA 73d587dView commit details -
Merge pull request #2 from converged-computing/reorganize-queue-manager
queue-manager: reorganize into strategies
Configuration menu - View commit details
-
Copy full SHA for 7add491 - Browse repository at this point
Copy the full SHA 7add491View commit details -
We do not need most of the fluence code-base with this new design. I am keeping fluxnetes as an in-tree plugin because it seems like an optimization to still be able to sort the initial queue of pods coming in. I am also thinking it would be unwise to completely remove in-tree plugins - they serve important purposes! But we do need to more integelligently integrate them with this new design. Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3361782 - Browse repository at this point
Copy the full SHA 3361782View commit details -
Merge pull request #6 from converged-computing/remove-fluence
refactor: remove fluence
Configuration menu - View commit details
-
Copy full SHA for c97121b - Browse repository at this point
Copy the full SHA c97121bView commit details -
docs: add new design image and notes
Signed-off-by: vsoch <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 532fc0b - Browse repository at this point
Copy the full SHA 532fc0bView commit details