docs: add new design image and notes

Signed-off-by: vsoch <[email protected]>
converged-computing · Jul 29, 2024 · 532fc0b · 532fc0b
1 parent c97121b
commit 532fc0b
Show file tree

Hide file tree

Showing 7 changed files with 37 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -109,6 +109,7 @@ SELECT group_name, group_size from pods_provisional;
 - [ ] The queue should inherit (and return) the start time (when the pod was first seen) "start" in scheduler.go
 - [ ] The provisional -> scheduled should do a sort for the timestamp (I mostly just forgot this)!
 - [ ] when in basic working state, add back build and test workflows
+- [ ] There should be a label (or existing value in the pod) to indicate an expected completion time (this is for Fluxion). We can have a worker task that explicitly cleans up the pods when the job should be completed.
 - [x] remove fluence previous code
 
 ## License

diff --git a/docs/README.md b/docs/README.md
@@ -2,13 +2,39 @@
 
 ## Design Notes
 
+> July 29, 2024
+
+Today we are merging in the "gut out and refactor" branch that does the following:
+
+ - Add Queue Manager (and queues design, shown and described below) 
+ - Remove what remains of Fluence (now Fluxnetes is just a shell to provide sort)
+ - Replace the default scheduler (schedulingCycle) with this approach (we still use bindingCycle)
+
+The new queue design is based on a producer consumer model in that there are  workers (the number of our choosing) each associated with different queues. The workers themselves can do different things, and this depends on both the queue and Queuing strategy (I say this because two different strategies can share a common worker design). Before we hit a worker queue, we have a provisional queue step. This means that:
+
+1. Incoming pods are added to a provisional table with their name, group, timestamp, and expected size.
+2. Pods are moved from the provisional table to the worker queue when they reach quorum (the minimum size)
+3. At this point, they go into the hands of a Queue Manager, ordered by their group timestamp.
+
+For the first that I've added, which I'm calling FCFS with backfill, the worker task does a call to fluxion, specifically a `MatchAllocate`. I am planning to change this to a `MatchAllocateElseReserve` so I can "snooze" the job to trigger again in the future given that it cannot be scheduled then and there. When the work is allocated, the metadata for the job (specifically args for "Nodes") is updated to carry the nodes forward to events that are listening for them. A subscription event is sent back to the main scheduler, which receives the nodes, and then performs binding. The pods are received as a group, meaning the binding of the group happens at the same time (in a loop, still one by one, but guaranteed to be in that order I think) and the work is run. Some (high level) work that still needs to be done:
+
+- The provisional queue hardened up to be provided (and exposed) as explicit interfaces (it is part of the main fluxnetes queue module now)
+- A pod label for an expected time (and a default time) could be used so every job has an expected end time (for Fluxion). A cancel queue would handle this.
+- The in-tree plugin outputs (needs for volumes, and what nodes can provide) needs to be exposed to Fluxion. Either fluxion can be told:
+  - "These nodes aren't possible for this work"
+  - "These are the only nodes you can consider for this work"
+  - "Here is a resource requirement you know about in your graph"
+
+There are more features that still need to be worked on and added (see the README.md of this repository) but this is a good start! One thing I am tickled by is that this does not need to be Kubernetes specific. It happens to be implemented within it, but the only detail that is relevant to Kubernetes is having a pod derive the underlying unit of work. All of the logic could be moved outside of it, with some other unit of work.
+
+![images/fluxnetes.png](images/fluxnetes.png)
+
 > July 10th, 2024
 
 Fluxnetes is functioning, on equal par with what fluence does to schedule and cancel pods. The difference is that I removed the webhook and controller to create PodGroup, and (for the time being) am relying on the user to create them. The reason is because I don't want to add the complexity of a new controller and webhook to Kubernetes. And instead of doing a custom CR (custom resource) for our PodGroup, I am using the one from coscheduling. THis allows install of the module without breaking smaller level dependencies. I'm not sure why that works, but it does!
 
 So the current state is that Fluxnetes is scheduling! My next step is to slowly add components for the new design, ensuring I don't break anything as I go, and going as far with that approach as I can until I need to swap it in. Then I'll likely need to be a bit more destructive and careful.
 
-
 > This was a group update on July 8th, 2024
 
 An update on design thinking for what I'm calling "fluxnetes" - a next step experiment for Kubernetes and Fluxion integration. Apologies in advance this is long - I do a lot of thinking and have desire to express it, because I don't think the design process (our thinking!) is always shared transparently. To start, there are two strategies to take:

diff --git a/docs/images/fluxnetes-v1.png b/docs/images/fluxnetes-v1.png
diff --git a/docs/images/fluxnetes.png b/docs/images/fluxnetes.png
diff --git a/kubernetes/pkg/fluxnetes/fluxnetes.go b/kubernetes/pkg/fluxnetes/fluxnetes.go
@@ -79,8 +79,8 @@ func (fluxnetes *Fluxnetes) Less(podInfo1, podInfo2 *framework.QueuedPodInfo) bo
 	// which is what fluxnetes needs to distinguish between namespaces. Just the
 	// name could be replicated between different namespaces
 	// TODO add some representation of PodGroup back
-	name1 := groups.GetPodGroupName(podInfo1.Pod)
-	name2 := groups.GetPodGroupName(podInfo2.Pod)
+	name1 := groups.GetPodGroupFullName(podInfo1.Pod)
+	name2 := groups.GetPodGroupFullName(podInfo2.Pod)
 
 	// Try for creation time first, and fall back to naming
 	creationTime1 := groups.GetPodCreationTimestamp(podInfo1.Pod)

diff --git a/kubernetes/pkg/fluxnetes/group/group.go b/kubernetes/pkg/fluxnetes/group/group.go
@@ -32,6 +32,13 @@ func GetPodGroupName(pod *corev1.Pod) string {
 	return groupName
 }
 
+// GetPodGroupFullName get namespaced group name from pod labels
+// This is primarily for sorting, so we consider namespace too.
+func GetPodGroupFullName(pod *corev1.Pod) string {
+	groupName := GetPodGroupName(pod)
+	return fmt.Sprintf("%v/%v", pod.Namespace, groupName)
+}
+
 // getPodGroupSize gets the group size, first from label then default of 1
 func GetPodGroupSize(pod *corev1.Pod) (int32, error) {
 

diff --git a/kubernetes/pkg/fluxnetes/labels/labels.go b/kubernetes/pkg/fluxnetes/labels/labels.go
@@ -1,8 +1,6 @@
 package labels
 
 import (
-	"fmt"
-
 	v1 "k8s.io/api/core/v1"
 )
 
@@ -22,12 +20,3 @@ const (
 func GetPodGroupLabel(pod *v1.Pod) string {
 	return pod.Labels[PodGroupLabel]
 }
-
-// GetPodGroupFullName get namespaced group name from pod labels
-func GetPodGroupFullName(pod *v1.Pod) string {
-	groupName := GetPodGroupLabel(pod)
-	if len(groupName) == 0 {
-		return ""
-	}
-	return fmt.Sprintf("%v/%v", pod.Namespace, groupName)
-}