Skip to content

Commit

Permalink
Merge pull request #5 from converged-computing/add-queue-and-gut-out
Browse files Browse the repository at this point in the history
Add queue and gut out
  • Loading branch information
vsoch authored Jul 29, 2024
2 parents 2c1a903 + 532fc0b commit 223854d
Show file tree
Hide file tree
Showing 37 changed files with 1,557 additions and 1,612 deletions.
6 changes: 4 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ ENV ARCH=${ARCH}
# but since we are adding custom kube-scheduler, and we don't need the controller
# I moved the build logic up here instead of using hack/build-images.sh

RUN apt-get update && apt-get install -y wget git vim build-essential iputils-ping
RUN apt-get update && apt-get install -y wget git vim build-essential iputils-ping postgresql-client curl

# Install Go
ENV GO_VERSION=1.22.2
ENV GO_VERSION=1.22.5
RUN wget https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz && tar -xvf go${GO_VERSION}.linux-amd64.tar.gz && \
mv go /usr/local && rm go${GO_VERSION}.linux-amd64.tar.gz

Expand All @@ -28,6 +28,8 @@ COPY ${K8S_UPSTREAM} .
RUN go get github.com/patrickmn/go-cache && \
go get sigs.k8s.io/controller-runtime/pkg/client && \
go get sigs.k8s.io/scheduler-plugins/apis/scheduling/v1alpha1 && \
go get github.com/riverqueue/river && \
go get github.com/riverqueue/river/riverdriver/riverpgxv5 && \
go work vendor && \
make WHAT=cmd/kube-scheduler && \
cp /go/src/k8s.io/kubernetes/_output/local/go/bin/kube-scheduler /bin/kube-scheduler
Expand Down
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,12 @@ ARCH ?= amd64
# These are passed to build the sidecar
REGISTRY ?= ghcr.io/flux-framework
SIDECAR_IMAGE ?= fluxnetes-sidecar:latest
POSTGRES_IMAGE ?= fluxnetes-postgres:latest
SCHEDULER_IMAGE ?= fluxnetes

.PHONY: all build build-sidecar clone update push push-sidecar push-fluxnetes
.PHONY: all build build-sidecar clone update push push-sidecar push-fluxnetes build-postgres

all: prepare build-sidecar build
all: prepare build-sidecar build build-postgres

upstreams:
mkdir -p $(UPSTREAMS)
Expand Down Expand Up @@ -48,4 +49,7 @@ push-fluxnetes:
build-sidecar:
make -C ./src LOCAL_REGISTRY=${REGISTRY} LOCAL_IMAGE=${SIDECAR_IMAGE}

build-postgres:
docker build -f src/build/postgres/Dockerfile -t ${REGISTRY}/${POSTGRES_IMAGE} .

push: push-sidecar push-fluxnetes
44 changes: 40 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

![docs/images/fluxnetes.png](docs/images/fluxnetes.png)

Fluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes.
Fluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes. For our queue, we use [river](https://riverqueue.com/docs) backed by a Postgres database. The database is deployed alongside fluence and could be customized to use an operator instead.

**Important** This is an experiment, and is under development. I will change this design a million times - it's how I tend to learn and work. I'll share updates when there is something to share. It deploys but does not work yet!

Expand Down Expand Up @@ -37,14 +37,15 @@ Then you can deploy as follows:
```bash
./hack/quick-build-kind.sh
```
You'll then have the fluxnetes service running, along with the scheduler plugins controller, which we
You'll then have the fluxnetes service running, a postgres database (for the job queue), along with the scheduler plugins controller, which we
currently have to use PodGroup.

```bash
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fluxnetes-66575b59d8-ghx8h 2/2 Running 0 8m53s
scheduler-plugins-controller-8676df7769-ss9kz 1/1 Running 0 8m53s
fluxnetes-6954cdcf64-gv7s7 2/2 Running 0 87s
postgres-c8d55999c-t6dtt 1/1 Running 0 87s
scheduler-plugins-controller-8676df7769-jvtwp 1/1 Running 0 87s
```

You can then create a job:
Expand Down Expand Up @@ -75,6 +76,41 @@ scheduler-plugins-controller-8676df7769-ss9kz 1/1 Running 0 10
And that's it! This is fully working, but this only means that we are going to next work on the new design.
See [docs](docs) for notes on that.

## Development

### Debugging Postgres

It is often helpful to shell into the postgres container to see the database directly:

```bash
kubectl exec -it postgres-597db46977-9lb25 bash
psql -U postgres

# Connect to database
\c

# list databases
\l

# show tables
\dt

# test a query
SELECT group_name, group_size from pods_provisional;
```

### TODO

- [ ] I'd like a more efficient query (or strategy) to move pods from provisional into the worker queue. Right now I have three queries and it's too many.
- [ ] Discussion about how to respond to a "failed" allocation request (meaning we just can't give nodes now, likely to happen a lot). Maybe we need to do a reservation instead?
- [ ] I think maybe we should do a match allocate else reserve instead (see issue [here](https://github.com/converged-computing/fluxnetes/issues/4))
- [ ] Restarting with postgres shouldn't have crashloopbackoff when the database isn't ready yet
- [ ] In-tree registry plugins (that are related to resources) should be run first to inform fluxion what nodes not to bind, where there are volumes, etc.
- [ ] The queue should inherit (and return) the start time (when the pod was first seen) "start" in scheduler.go
- [ ] The provisional -> scheduled should do a sort for the timestamp (I mostly just forgot this)!
- [ ] when in basic working state, add back build and test workflows
- [ ] There should be a label (or existing value in the pod) to indicate an expected completion time (this is for Fluxion). We can have a worker task that explicitly cleans up the pods when the job should be completed.
- [x] remove fluence previous code

## License

Expand Down
15 changes: 6 additions & 9 deletions chart/templates/configmap.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
{{- if .Values.plugins.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
Expand All @@ -14,6 +13,11 @@ data:
# Compose all plugins in one profile
- schedulerName: {{ .Values.scheduler.name }}
plugins:
queueSort:
enabled:
{{- range $.Values.plugins.enabled }}
- name: {{ title . }}
{{- end }}
preBind:
disabled:
- name: {{ .Values.scheduler.name }}
Expand Down Expand Up @@ -48,17 +52,10 @@ data:
- name: {{ title . }}
{{- end }}
multiPoint:
enabled:
{{- range $.Values.plugins.enabled }}
- name: {{ title . }}
{{- end }}
disabled:
{{- range $.Values.plugins.disabled }}
- name: {{ title . }}
{{- end }}
{{- if $.Values.pluginConfig }}
pluginConfig: {{ toYaml $.Values.pluginConfig | nindent 6 }}
{{- end }}
{{- /* TODO: wire CRD installation with enabled plugins. */}}
{{- end }}
{{- end }}
22 changes: 18 additions & 4 deletions chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,17 @@ spec:
- command:
- /bin/kube-scheduler
- --config=/etc/kubernetes/scheduler-config.yaml
env:
- name: DATABASE_URL
value: postgres://postgres:postgres@postgres:5432/postgres
- name: PGHOST
value: postgres
- name: PGDATABASE
value: postgres
- name: PGPORT
value: "5432"
- name: PGPASSWORD
value: postgres
image: {{ .Values.scheduler.image }}
imagePullPolicy: {{ .Values.scheduler.pullPolicy }}
livenessProbe:
Expand All @@ -76,10 +87,13 @@ spec:
initialDelaySeconds: 15
name: scheduler
readinessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
exec:
command:
- "sh"
- "-c"
- >
status=$(curl -ks https://localhost:10259/healthz); if [ "$status" -ne "ok" ]; then exit 1; fi
pg_isready -d postgres -h postgres -p 5432 -U postgres;
resources:
requests:
cpu: '0.1'
Expand Down
63 changes: 63 additions & 0 deletions chart/templates/postgres.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Note: This is intended for development/test deployments
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
namespace: {{ .Release.Namespace }}
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: {{ .Values.postgres.image }}
imagePullPolicy: {{ .Values.postgres.pullPolicy }}
ports:
- name: postgres-port
containerPort: 5432
env:
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: postgres
- name: POSTGRES_DB
value: postgres
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "2"
readinessProbe:
exec:
command:
- "sh"
- "-c"
- >
pg_isready -q -d postgres -U postgres;
runuser -l postgres -c '/usr/local/bin/river migrate-up --database-url postgres://localhost:5432/postgres > /tmp/post-start.log'
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: {{ .Release.Namespace }}
spec:
selector:
app: postgres
ports:
- protocol: TCP
port: 5432
targetPort: 5432
5 changes: 5 additions & 0 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ scheduler:
pullPolicy: Always
leaderElect: false

database:
image: ghcr.io/flux-framework/fluxnetes-postgres:latest
pullPolicy: Always

# The sidecar is explicitly the fluxion service. I'd like to
# simplify this to use fluxion as a service
sidecar:
Expand Down Expand Up @@ -38,6 +42,7 @@ controller:
# as they need extra RBAC privileges on metrics.k8s.io.

plugins:
# We keep this enabled for the custom queue sort
enabled: ["Fluxnetes"]
disabled: ["CapacityScheduling","NodeResourceTopologyMatch","NodeResourcesAllocatable","PrioritySort","Coscheduling"] # only in-tree plugins need to be defined here
# Disable EVERYTHING except for fluxnetes
Expand Down
28 changes: 27 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,39 @@

## Design Notes

> July 29, 2024
Today we are merging in the "gut out and refactor" branch that does the following:

- Add Queue Manager (and queues design, shown and described below)
- Remove what remains of Fluence (now Fluxnetes is just a shell to provide sort)
- Replace the default scheduler (schedulingCycle) with this approach (we still use bindingCycle)

The new queue design is based on a producer consumer model in that there are workers (the number of our choosing) each associated with different queues. The workers themselves can do different things, and this depends on both the queue and Queuing strategy (I say this because two different strategies can share a common worker design). Before we hit a worker queue, we have a provisional queue step. This means that:

1. Incoming pods are added to a provisional table with their name, group, timestamp, and expected size.
2. Pods are moved from the provisional table to the worker queue when they reach quorum (the minimum size)
3. At this point, they go into the hands of a Queue Manager, ordered by their group timestamp.

For the first that I've added, which I'm calling FCFS with backfill, the worker task does a call to fluxion, specifically a `MatchAllocate`. I am planning to change this to a `MatchAllocateElseReserve` so I can "snooze" the job to trigger again in the future given that it cannot be scheduled then and there. When the work is allocated, the metadata for the job (specifically args for "Nodes") is updated to carry the nodes forward to events that are listening for them. A subscription event is sent back to the main scheduler, which receives the nodes, and then performs binding. The pods are received as a group, meaning the binding of the group happens at the same time (in a loop, still one by one, but guaranteed to be in that order I think) and the work is run. Some (high level) work that still needs to be done:

- The provisional queue hardened up to be provided (and exposed) as explicit interfaces (it is part of the main fluxnetes queue module now)
- A pod label for an expected time (and a default time) could be used so every job has an expected end time (for Fluxion). A cancel queue would handle this.
- The in-tree plugin outputs (needs for volumes, and what nodes can provide) needs to be exposed to Fluxion. Either fluxion can be told:
- "These nodes aren't possible for this work"
- "These are the only nodes you can consider for this work"
- "Here is a resource requirement you know about in your graph"

There are more features that still need to be worked on and added (see the README.md of this repository) but this is a good start! One thing I am tickled by is that this does not need to be Kubernetes specific. It happens to be implemented within it, but the only detail that is relevant to Kubernetes is having a pod derive the underlying unit of work. All of the logic could be moved outside of it, with some other unit of work.

![images/fluxnetes.png](images/fluxnetes.png)

> July 10th, 2024
Fluxnetes is functioning, on equal par with what fluence does to schedule and cancel pods. The difference is that I removed the webhook and controller to create PodGroup, and (for the time being) am relying on the user to create them. The reason is because I don't want to add the complexity of a new controller and webhook to Kubernetes. And instead of doing a custom CR (custom resource) for our PodGroup, I am using the one from coscheduling. THis allows install of the module without breaking smaller level dependencies. I'm not sure why that works, but it does!

So the current state is that Fluxnetes is scheduling! My next step is to slowly add components for the new design, ensuring I don't break anything as I go, and going as far with that approach as I can until I need to swap it in. Then I'll likely need to be a bit more destructive and careful.


> This was a group update on July 8th, 2024
An update on design thinking for what I'm calling "fluxnetes" - a next step experiment for Kubernetes and Fluxion integration. Apologies in advance this is long - I do a lot of thinking and have desire to express it, because I don't think the design process (our thinking!) is always shared transparently. To start, there are two strategies to take:
Expand Down
Binary file added docs/images/fluxnetes-v1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/fluxnetes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 0 additions & 9 deletions examples/job.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
# PodGroup CRD spec
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
name: job
spec:
scheduleTimeoutSeconds: 10
minMember: 1
---
apiVersion: batch/v1
kind: Job
metadata:
Expand Down
5 changes: 4 additions & 1 deletion hack/quick-build-kind.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,15 @@ make REGISTRY=${REGISTRY} SCHEDULER_IMAGE=fluxnetes SIDECAR_IMAGE=fluxnetes-side
# We load into kind so we don't need to push/pull and use up internet data ;)
kind load docker-image ${REGISTRY}/fluxnetes-sidecar:latest
kind load docker-image ${REGISTRY}/fluxnetes:latest
kind load docker-image ${REGISTRY}/fluxnetes-postgres:latest

# And then install using the charts. The pull policy ensures we use the loaded ones
helm uninstall fluxnetes || true
helm install \
--set postgres.image=${REGISTRY}/fluxnetes-postgres:latest \
--set scheduler.image=${REGISTRY}/fluxnetes:latest \
--set sidecar.image=${REGISTRY}/fluxnetes-sidecar:latest \
--set postgres.pullPolicy=Never \
--set scheduler.pullPolicy=Never \
--set sidecar.pullPolicy=Never \
--set sidecar.image=${REGISTRY}/fluxnetes-sidecar:latest \
fluxnetes chart/
Loading

0 comments on commit 223854d

Please sign in to comment.