Merge pull request #5 from converged-computing/add-queue-and-gut-out

Add queue and gut out
converged-computing · Jul 29, 2024 · 223854d · 223854d
2 parents 2c1a903 + 532fc0b
commit 223854d
Show file tree

Hide file tree

Showing 37 changed files with 1,557 additions and 1,612 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -10,10 +10,10 @@ ENV ARCH=${ARCH}
 # but since we are adding custom kube-scheduler, and we don't need the controller
 # I moved the build logic up here instead of using hack/build-images.sh
 
-RUN apt-get update && apt-get install -y wget git vim build-essential iputils-ping
+RUN apt-get update && apt-get install -y wget git vim build-essential iputils-ping postgresql-client curl
 
 # Install Go
-ENV GO_VERSION=1.22.2
+ENV GO_VERSION=1.22.5
 RUN wget https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz  && tar -xvf go${GO_VERSION}.linux-amd64.tar.gz && \
          mv go /usr/local && rm go${GO_VERSION}.linux-amd64.tar.gz
 
@@ -28,6 +28,8 @@ COPY ${K8S_UPSTREAM} .
 RUN go get github.com/patrickmn/go-cache && \
     go get sigs.k8s.io/controller-runtime/pkg/client && \
     go get sigs.k8s.io/scheduler-plugins/apis/scheduling/v1alpha1 && \
+    go get github.com/riverqueue/river && \
+    go get github.com/riverqueue/river/riverdriver/riverpgxv5 && \
     go work vendor && \
     make WHAT=cmd/kube-scheduler && \
     cp /go/src/k8s.io/kubernetes/_output/local/go/bin/kube-scheduler /bin/kube-scheduler

diff --git a/Makefile b/Makefile
@@ -15,11 +15,12 @@ ARCH ?= amd64
 # These are passed to build the sidecar
 REGISTRY ?= ghcr.io/flux-framework
 SIDECAR_IMAGE ?= fluxnetes-sidecar:latest
+POSTGRES_IMAGE ?= fluxnetes-postgres:latest
 SCHEDULER_IMAGE ?= fluxnetes
 
-.PHONY: all build build-sidecar clone update push push-sidecar push-fluxnetes
+.PHONY: all build build-sidecar clone update push push-sidecar push-fluxnetes build-postgres
 
-all: prepare build-sidecar build
+all: prepare build-sidecar build build-postgres
 
 upstreams: 
 	mkdir -p $(UPSTREAMS)
@@ -48,4 +49,7 @@ push-fluxnetes:
 build-sidecar: 
 	make -C ./src LOCAL_REGISTRY=${REGISTRY} LOCAL_IMAGE=${SIDECAR_IMAGE}
 
+build-postgres: 
+	docker build -f src/build/postgres/Dockerfile -t ${REGISTRY}/${POSTGRES_IMAGE} .
+
 push: push-sidecar push-fluxnetes
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 ![docs/images/fluxnetes.png](docs/images/fluxnetes.png)
 
-Fluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes. 
+Fluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes. For our queue, we use [river](https://riverqueue.com/docs) backed by a Postgres database. The database is deployed alongside fluence and could be customized to use an operator instead.
 
 **Important** This is an experiment, and is under development. I will change this design a million times - it's how I tend to learn and work. I'll share updates when there is something to share. It deploys but does not work yet!
 
@@ -37,14 +37,15 @@ Then you can deploy as follows:
 ```bash
 ./hack/quick-build-kind.sh
 ```
-You'll then have the fluxnetes service running, along with the scheduler plugins controller, which we
+You'll then have the fluxnetes service running, a postgres database (for the job queue), along with the scheduler plugins controller, which we
 currently have to use PodGroup.
 
 ```bash
 $ kubectl get pods
 NAME                                            READY   STATUS    RESTARTS   AGE
-fluxnetes-66575b59d8-ghx8h                      2/2     Running   0          8m53s
-scheduler-plugins-controller-8676df7769-ss9kz   1/1     Running   0          8m53s
+fluxnetes-6954cdcf64-gv7s7                      2/2     Running   0          87s
+postgres-c8d55999c-t6dtt                        1/1     Running   0          87s
+scheduler-plugins-controller-8676df7769-jvtwp   1/1     Running   0          87s
 ```
 
 You can then create a job:
@@ -75,6 +76,41 @@ scheduler-plugins-controller-8676df7769-ss9kz   1/1     Running    0          10
 And that's it! This is fully working, but this only means that we are going to next work on the new design.
 See [docs](docs) for notes on that.
 
+## Development
+
+### Debugging Postgres
+
+It is often helpful to shell into the postgres container to see the database directly:
+
+```bash
+kubectl exec -it postgres-597db46977-9lb25 bash
+psql -U postgres
+
+# Connect to database 
+\c
+
+# list databases
+\l
+
+# show tables
+\dt
+
+# test a query
+SELECT group_name, group_size from pods_provisional;
+```
+
+### TODO
+
+- [ ] I'd like a more efficient query (or strategy) to move pods from provisional into the worker queue. Right now I have three queries and it's too many.
+- [ ] Discussion about how to respond to a "failed" allocation request (meaning we just can't give nodes now, likely to happen a lot). Maybe we need to do a reservation instead?
+- [ ] I think maybe we should do a match allocate else reserve instead (see issue [here](https://github.com/converged-computing/fluxnetes/issues/4))
+- [ ] Restarting with postgres shouldn't have crashloopbackoff when the database isn't ready yet
+- [ ] In-tree registry plugins (that are related to resources) should be run first to inform fluxion what nodes not to bind, where there are volumes, etc.
+- [ ] The queue should inherit (and return) the start time (when the pod was first seen) "start" in scheduler.go
+- [ ] The provisional -> scheduled should do a sort for the timestamp (I mostly just forgot this)!
+- [ ] when in basic working state, add back build and test workflows
+- [ ] There should be a label (or existing value in the pod) to indicate an expected completion time (this is for Fluxion). We can have a worker task that explicitly cleans up the pods when the job should be completed.
+- [x] remove fluence previous code
 
 ## License
 

diff --git a/chart/templates/configmap.yaml b/chart/templates/configmap.yaml
@@ -1,4 +1,3 @@
-{{- if .Values.plugins.enabled }}
 apiVersion: v1
 kind: ConfigMap
 metadata:
@@ -14,6 +13,11 @@ data:
     # Compose all plugins in one profile
     - schedulerName: {{ .Values.scheduler.name }}
       plugins:
+        queueSort:
+          enabled:
+          {{- range $.Values.plugins.enabled }}
+          - name: {{ title . }}
+          {{- end }}
         preBind:
           disabled:
            - name: {{ .Values.scheduler.name }}
@@ -48,17 +52,10 @@ data:
           - name: {{ title . }}
           {{- end }}
         multiPoint:
-          enabled:
-          {{- range $.Values.plugins.enabled }}
-          - name: {{ title . }}
-          {{- end }}
           disabled:
           {{- range $.Values.plugins.disabled }}
           - name: {{ title . }}
           {{- end }}
       {{- if $.Values.pluginConfig }}
       pluginConfig: {{ toYaml $.Values.pluginConfig | nindent 6 }}
-      {{- end }}
-
-  {{- /* TODO: wire CRD installation with enabled plugins. */}}
-{{- end }}
+      {{- end }}
diff --git a/chart/templates/deployment.yaml b/chart/templates/deployment.yaml
@@ -66,6 +66,17 @@ spec:
       - command:
         - /bin/kube-scheduler
         - --config=/etc/kubernetes/scheduler-config.yaml
+        env:
+          - name: DATABASE_URL
+            value: postgres://postgres:postgres@postgres:5432/postgres
+          - name: PGHOST
+            value: postgres
+          - name:  PGDATABASE
+            value: postgres
+          - name: PGPORT
+            value: "5432"
+          - name: PGPASSWORD
+            value: postgres
         image: {{ .Values.scheduler.image }}
         imagePullPolicy: {{ .Values.scheduler.pullPolicy }}
         livenessProbe:
@@ -76,10 +87,13 @@ spec:
           initialDelaySeconds: 15
         name: scheduler
         readinessProbe:
-          httpGet:
-            path: /healthz
-            port: 10259
-            scheme: HTTPS
+          exec:
+            command:
+              - "sh"
+              - "-c"
+              - >
+                status=$(curl -ks  https://localhost:10259/healthz); if [ "$status" -ne "ok" ]; then exit 1; fi
+                pg_isready -d postgres -h postgres -p 5432 -U postgres;
         resources:
           requests:
             cpu: '0.1'

diff --git a/chart/templates/postgres.yaml b/chart/templates/postgres.yaml
@@ -0,0 +1,63 @@
+# Note: This is intended for development/test deployments
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: postgres
+  namespace: {{ .Release.Namespace }}
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: postgres
+  template:
+    metadata:
+      labels:
+        app: postgres
+    spec:
+      containers:
+        - name: postgres
+          image: {{ .Values.postgres.image }}
+          imagePullPolicy: {{ .Values.postgres.pullPolicy }}
+          ports:
+            - name: postgres-port
+              containerPort: 5432
+          env:
+            - name: POSTGRES_USER
+              value: postgres
+            - name: POSTGRES_PASSWORD
+              value: postgres
+            - name: POSTGRES_DB
+              value: postgres
+          resources:
+            requests:
+              memory: "1Gi"
+              cpu: "500m"
+            limits:
+              memory: "1Gi"
+              cpu: "2"
+          readinessProbe:
+            exec:
+              command:
+                - "sh"
+                - "-c"
+                - >
+                  pg_isready -q -d postgres -U postgres;
+                  runuser -l postgres -c '/usr/local/bin/river migrate-up --database-url postgres://localhost:5432/postgres > /tmp/post-start.log'
+            initialDelaySeconds: 5
+            periodSeconds: 10
+            timeoutSeconds: 5
+            successThreshold: 1
+            failureThreshold: 3
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: postgres
+  namespace: {{ .Release.Namespace }}
+spec:
+  selector:
+    app: postgres
+  ports:
+    - protocol: TCP
+      port: 5432
+      targetPort: 5432
diff --git a/chart/values.yaml b/chart/values.yaml
@@ -11,6 +11,10 @@ scheduler:
   pullPolicy: Always
   leaderElect: false
 
+database:
+  image: ghcr.io/flux-framework/fluxnetes-postgres:latest
+  pullPolicy: Always
+
 # The sidecar is explicitly the fluxion service. I'd like to
 # simplify this to use fluxion as a service
 sidecar:
@@ -38,6 +42,7 @@ controller:
 # as they need extra RBAC privileges on metrics.k8s.io.
 
 plugins:
+  # We keep this enabled for the custom queue sort
   enabled: ["Fluxnetes"]
   disabled: ["CapacityScheduling","NodeResourceTopologyMatch","NodeResourcesAllocatable","PrioritySort","Coscheduling"] # only in-tree plugins need to be defined here
   # Disable EVERYTHING except for fluxnetes

diff --git a/docs/README.md b/docs/README.md
@@ -2,13 +2,39 @@
 
 ## Design Notes
 
+> July 29, 2024
+
+Today we are merging in the "gut out and refactor" branch that does the following:
+
+ - Add Queue Manager (and queues design, shown and described below) 
+ - Remove what remains of Fluence (now Fluxnetes is just a shell to provide sort)
+ - Replace the default scheduler (schedulingCycle) with this approach (we still use bindingCycle)
+
+The new queue design is based on a producer consumer model in that there are  workers (the number of our choosing) each associated with different queues. The workers themselves can do different things, and this depends on both the queue and Queuing strategy (I say this because two different strategies can share a common worker design). Before we hit a worker queue, we have a provisional queue step. This means that:
+
+1. Incoming pods are added to a provisional table with their name, group, timestamp, and expected size.
+2. Pods are moved from the provisional table to the worker queue when they reach quorum (the minimum size)
+3. At this point, they go into the hands of a Queue Manager, ordered by their group timestamp.
+
+For the first that I've added, which I'm calling FCFS with backfill, the worker task does a call to fluxion, specifically a `MatchAllocate`. I am planning to change this to a `MatchAllocateElseReserve` so I can "snooze" the job to trigger again in the future given that it cannot be scheduled then and there. When the work is allocated, the metadata for the job (specifically args for "Nodes") is updated to carry the nodes forward to events that are listening for them. A subscription event is sent back to the main scheduler, which receives the nodes, and then performs binding. The pods are received as a group, meaning the binding of the group happens at the same time (in a loop, still one by one, but guaranteed to be in that order I think) and the work is run. Some (high level) work that still needs to be done:
+
+- The provisional queue hardened up to be provided (and exposed) as explicit interfaces (it is part of the main fluxnetes queue module now)
+- A pod label for an expected time (and a default time) could be used so every job has an expected end time (for Fluxion). A cancel queue would handle this.
+- The in-tree plugin outputs (needs for volumes, and what nodes can provide) needs to be exposed to Fluxion. Either fluxion can be told:
+  - "These nodes aren't possible for this work"
+  - "These are the only nodes you can consider for this work"
+  - "Here is a resource requirement you know about in your graph"
+
+There are more features that still need to be worked on and added (see the README.md of this repository) but this is a good start! One thing I am tickled by is that this does not need to be Kubernetes specific. It happens to be implemented within it, but the only detail that is relevant to Kubernetes is having a pod derive the underlying unit of work. All of the logic could be moved outside of it, with some other unit of work.
+
+![images/fluxnetes.png](images/fluxnetes.png)
+
 > July 10th, 2024
 
 Fluxnetes is functioning, on equal par with what fluence does to schedule and cancel pods. The difference is that I removed the webhook and controller to create PodGroup, and (for the time being) am relying on the user to create them. The reason is because I don't want to add the complexity of a new controller and webhook to Kubernetes. And instead of doing a custom CR (custom resource) for our PodGroup, I am using the one from coscheduling. THis allows install of the module without breaking smaller level dependencies. I'm not sure why that works, but it does!
 
 So the current state is that Fluxnetes is scheduling! My next step is to slowly add components for the new design, ensuring I don't break anything as I go, and going as far with that approach as I can until I need to swap it in. Then I'll likely need to be a bit more destructive and careful.
 
-
 > This was a group update on July 8th, 2024
 
 An update on design thinking for what I'm calling "fluxnetes" - a next step experiment for Kubernetes and Fluxion integration. Apologies in advance this is long - I do a lot of thinking and have desire to express it, because I don't think the design process (our thinking!) is always shared transparently. To start, there are two strategies to take:

diff --git a/docs/images/fluxnetes-v1.png b/docs/images/fluxnetes-v1.png
diff --git a/docs/images/fluxnetes.png b/docs/images/fluxnetes.png
diff --git a/examples/job.yaml b/examples/job.yaml
@@ -1,12 +1,3 @@
-# PodGroup CRD spec
-apiVersion: scheduling.x-k8s.io/v1alpha1
-kind: PodGroup
-metadata:
-  name: job
-spec:
-  scheduleTimeoutSeconds: 10
-  minMember: 1
----
 apiVersion: batch/v1
 kind: Job
 metadata:

diff --git a/hack/quick-build-kind.sh b/hack/quick-build-kind.sh
@@ -16,12 +16,15 @@ make REGISTRY=${REGISTRY} SCHEDULER_IMAGE=fluxnetes SIDECAR_IMAGE=fluxnetes-side
 # We load into kind so we don't need to push/pull and use up internet data ;)
 kind load docker-image ${REGISTRY}/fluxnetes-sidecar:latest
 kind load docker-image ${REGISTRY}/fluxnetes:latest
+kind load docker-image ${REGISTRY}/fluxnetes-postgres:latest
 
 # And then install using the charts. The pull policy ensures we use the loaded ones
 helm uninstall fluxnetes || true
 helm install \
+  --set postgres.image=${REGISTRY}/fluxnetes-postgres:latest \
   --set scheduler.image=${REGISTRY}/fluxnetes:latest \
+  --set sidecar.image=${REGISTRY}/fluxnetes-sidecar:latest \
+  --set postgres.pullPolicy=Never \
   --set scheduler.pullPolicy=Never \
   --set sidecar.pullPolicy=Never \
-  --set sidecar.image=${REGISTRY}/fluxnetes-sidecar:latest \
         fluxnetes chart/