Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add queue and gut out #5

Merged
merged 8 commits into from
Jul 29, 2024
Merged

Add queue and gut out #5

merged 8 commits into from
Jul 29, 2024

Commits on Jul 26, 2024

  1. feat: new queue to handle groups

    This changeset adds a new queue to the fluxnetes in-tree plugin,
    which currently knows how to accept a pod for work, and then
    just sleep (basically reschedule for 5 seconds into the future).
    This is not currently hooked into Kubernetes scheduling because
    I want to develop the functionality I need first, in parallel,
    before splicing it in. I should still be able to schedule to
    Fluxion and trigger cleanup when the actual job is done. I
    think we might do better to remove the group CRD too - it would
    hugely simplify things (the in-tree plugin would barely need
    anything aside from the fluxion interactions and queue) and
    instead we can keep track of group names and counts (that are
    still growing) in a separate table, since we already have postgres.
    Two things I am not sure about include 1. the extent to which
    in-tree plugins support scheduling. I can either keep them (and
    then would need to integrate) or have their functionality move
    into what fluxion can offer. I suspect they add supplementary
    features since we were able to disable most of them. The second
    thing I am not sure about (I will figure out) is, given that
    we customize the plugin framework, where the right place to
    put sort is. If we are adding pods to a table we will need to
    store the same metadata (priority, timestamp, etc) to allow
    for this equivalent sort.
    
    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    8047000 View commit details
    Browse the repository at this point in the history

Commits on Jul 28, 2024

  1. worker: retrieval of podspec and AskFlux

    This changeset creates separate worker and podgroup fluxnetes
    package files, and they handle worker definition and pod group
    parsing functions, respectively. Up to this point we can now
    1. retrieve a new pod and see if it is in a group.
    2. if no (size 1) add to worker queue immediatel.
       if yes (size N) add to pods table to be inspected later
    3. retrieve the podspec in the work function
    4. parse back into podspec and ask flux for the allocation.
    I next need to do two things. First, figure out how to pass
    the node assignment back to the scheduler - I am hoping
    the job object "JobRow" can be modified to add metadata.
    Then we need to write the function to run at the end of
    a schedule cycle that moves groups from the provisional
    table to the worker queue
    
    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 28, 2024
    Configuration menu
    Copy the full SHA
    633bf36 View commit details
    Browse the repository at this point in the history
  2. queue-manager: reorganize into strategies

    I realized that we need flexibility in defining queue strategies,
    not just in how the worker is designed, but also how the queue
    strategy handles the schedule function. This is an overhaul (not
    quite done yet) that does that. I stil need to plug the final
    query back in to move provisional to the worker queue. Also
    note that it looks like we have priority, pending, and other
    insert params to play with.
    
    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 28, 2024
    Configuration menu
    Copy the full SHA
    69e7624 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2024

  1. notify: add working events to send nodes

    This changeset includes a query that will update Args (node)
    from within a worker job so we can send them back to the
    scheduler. I am lastly working on the command so that the
    initial query will move provisional pods (and groups) from
    the provisional table to the worker queue
    
    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    73d587d View commit details
    Browse the repository at this point in the history
  2. Merge pull request #2 from converged-computing/reorganize-queue-manager

    queue-manager: reorganize into strategies
    vsoch authored Jul 29, 2024
    Configuration menu
    Copy the full SHA
    7add491 View commit details
    Browse the repository at this point in the history
  3. refactor: remove fluence

    We do not need most of the fluence code-base with this new design.
    I am keeping fluxnetes as an in-tree plugin because it seems like
    an optimization to still be able to sort the initial queue of
    pods coming in. I am also thinking it would be unwise to completely
    remove in-tree plugins - they serve important purposes! But we do
    need to more integelligently integrate them with this new design.
    
    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    3361782 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #6 from converged-computing/remove-fluence

    refactor: remove fluence
    vsoch authored Jul 29, 2024
    Configuration menu
    Copy the full SHA
    c97121b View commit details
    Browse the repository at this point in the history
  5. docs: add new design image and notes

    Signed-off-by: vsoch <[email protected]>
    vsoch committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    532fc0b View commit details
    Browse the repository at this point in the history