-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new blog post: Custom K8s Scheduler for Highly Available Apps #24136
Conversation
Welcome @jess-edwards! |
✔️ Deploy preview for kubernetes-io-master-staging ready! 🔨 Explore the source changes: 3b6584c 🔍 Inspect the deploy logs: https://app.netlify.com/sites/kubernetes-io-master-staging/deploys/5fdbb1441813570007a6b417 😎 Browse the preview: https://deploy-preview-24136--kubernetes-io-master-staging.netlify.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jess-edwards and @chrisseto
Thanks for this interesting and comprehensive article!
At the moment there are a few things that look different from how we usually add content to this website, and I'm pretty sure that the blog team will need to you fix at least some of them. The most obvious one is that links within the site should use URIs relative to https://kubernetes.io/ (and not absolute URIs).
We'll also need confirmation from you as co-author @chrisseto that you have signed the Kubernetes contributor license agreement.
Many other details are less key; my hope is that you'll find all of the suggestions I've made useful.
If you have any questions about the feedback here (there's quite a lot of it, and I hope GitHub isn't about to show me a pink unicorn) then please feel free to reply. You can also find me as @sftim on Kubernetes' Slack, in #sig-docs.
--- | ||
|
||
# A Custom Kubernetes Scheduler to Orchestrate Highly Available Applications | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to see an author byline here. As the article makes several references to CockroachDB, it will help readers if they see an attribution to CockroachDB - that puts the mentions in context.
|
||
Most stateless systems, web servers for example, are created without the need to be aware of peers. Stateful systems, which includes databases like CockroachDB, have to coordinate with their peer instances and shuffle around data. As luck would have it, CockroachDB handles data redistribution and replication. The tricky part is being able to tolerate failures during these operations by ensuring that data and instances are distributed across many failure domains (availability zones). | ||
|
||
One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request (e.g, "I must be in availability A" [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#nodes-are-labeled)} or "I can't be on the same machine as this container" [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-isolation-restriction)). As an addition to these constraints Kubernetes' offers [Statefulsets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) which provide identity to pods and persistent disks that "follow" these identified pods. Identity is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous, if pods 1 and 3 exist then pod 2 must also exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow style guide recommendations including the strong suggestion to avoid absolute for same-site hyperlinks, I suggest:
One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request (e.g, "I must be in availability A" [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#nodes-are-labeled)} or "I can't be on the same machine as this container" [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-isolation-restriction)). As an addition to these constraints Kubernetes' offers [Statefulsets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) which provide identity to pods and persistent disks that "follow" these identified pods. Identity is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous, if pods 1 and 3 exist then pod 2 must also exist. | |
One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request. For example: "I must be in availability zone _A_" (see [Running in multiple zones](/docs/setup/best-practices/multiple-zones/#nodes-are-labeled)}), or "I can't be placed onto the same node as this other Pod" (see [Affinity and anti-affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)). | |
As an addition to those constraints, Kubernetes offers [Statefulsets](/docs/concepts/workloads/controllers/statefulset/) that provide identity to Pods as well as persistent storage that "follows" these identified pods. Identity in a StatefulSet is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous: in a StatefulSet, if pods 1 and 3 exist then pod 2 must also exist. |
I've tweaked the wording too, aiming to help with readability and to afford easier localization.
|
||
One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request (e.g, "I must be in availability A" [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#nodes-are-labeled)} or "I can't be on the same machine as this container" [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-isolation-restriction)). As an addition to these constraints Kubernetes' offers [Statefulsets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) which provide identity to pods and persistent disks that "follow" these identified pods. Identity is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous, if pods 1 and 3 exist then pod 2 must also exist. | ||
|
||
Under the hood, CockroachCloud deploys each region of CockroachDB as a Statefulset in its own Kubernetes cluster [docs](https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html). We'll be looking at an individual region, one Statefulset and one Kubernetes cluster which is distributed across at least three availability zones. A three-node CockroachCloud cluster would look something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the hood, CockroachCloud deploys each region of CockroachDB as a Statefulset in its own Kubernetes cluster [docs](https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html). We'll be looking at an individual region, one Statefulset and one Kubernetes cluster which is distributed across at least three availability zones. A three-node CockroachCloud cluster would look something like this: | |
Under the hood, CockroachCloud deploys each region of CockroachDB as a StatefulSet in its own Kubernetes cluster - see [Orchestrate CockroachDB in a Single Kubernetes Cluster](https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html). In this article, I'll be looking at an individual region, one StatefulSet and one Kubernetes cluster which is distributed across at least three availability zones. . | |
A three-node CockroachCloud cluster would look something like this: |
BTW, the main documentation addresses the reader as “you” and avoids using “we”.
|
||
![illustration of phases: adding Kubernetes nodes to the multi-zone cockroachdb cluster](image02.png) | ||
|
||
Note that anti-affinites are satisfied no matter the order in which pods are assigned to Kubernetes nodes. In the example, pods 0, 1 and 2 were assigned to zones A, B, and C respectively, but pods 3 and 4 were assigned in a different order, to zones B and A respectively. The anti-affinity is still satisfied because the pods are still placed in different zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that anti-affinites are satisfied no matter the order in which pods are assigned to Kubernetes nodes. In the example, pods 0, 1 and 2 were assigned to zones A, B, and C respectively, but pods 3 and 4 were assigned in a different order, to zones B and A respectively. The anti-affinity is still satisfied because the pods are still placed in different zones. | |
Note that anti-affinities are satisfied no matter the order in which pods are assigned to Kubernetes nodes. In the example, pods 0, 1 and 2 were assigned to zones A, B, and C respectively, but pods 3 and 4 were assigned in a different order, to zones B and A respectively. The anti-affinity is still satisfied because the pods are still placed in different zones. |
|
||
## A session of brainstorming left us with 3 options: | ||
|
||
### 1. Upgrade to kubernetes 1.18 and make use of [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyperlinks in headings interact poorly with automatic links to headings, so I suggest:
### 1. Upgrade to kubernetes 1.18 and make use of [Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/). | |
### 1. Upgrade to kubernetes 1.18 and make use of Pod Topology Spread Constraints |
As long as you're willing to follow the rules, deploying on Kubernetes and air travel can be quite pleasant. More often than not, things will "just work". However, if one is interested in travelling with an alligator that must remain alive or scaling a database that must remain available, the situation is likely to become a bit more complicated. It may even be easier to build one's own plane or database for that matter. Travelling with reptiles aside, scaling a highly available stateful system is no trivial task. | ||
|
||
Scaling any system has two main components: | ||
1. Adding or removing infrastructure that the system will run on and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Adding or removing infrastructure that the system will run on and | |
1. Adding or removing infrastructure that the system will run on, and |
|
||
![illustration of phases: scaling down pods in a multi-zone cockroachdb cluster in Kubernetes](image03.png) | ||
|
||
Now, remember that pods in a StatefulSet of size N must have ids in the range [0,N). When scaling down a StatefulSet by M, Kubernetes removes the M pods with the M highest with ordinals from highest to lowest, [the reverse in which they were added](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees). Consider the cluster topology below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write:
Now, remember that pods in a StatefulSet of size N must have ids in the range [0,N). When scaling down a StatefulSet by M, Kubernetes removes the M pods with the M highest with ordinals from highest to lowest, [the reverse in which they were added](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees). Consider the cluster topology below: | |
Now, remember that pods in a StatefulSet of size _n_ must have ids in the range `[0,n]`. When scaling down a StatefulSet by _m_, Kubernetes removes _m_ pods, starting from the highest ordinals and moving towards the lowest, [the reverse in which they were added](/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees). | |
Consider the cluster topology below: |
|
||
However, Kubernetes' scheduler doesn't guarantee the placement above as we expected at first. | ||
|
||
> Pods in a replication controller or service are automatically spread across zones. [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#pods-are-spread-across-zones) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> Pods in a replication controller or service are automatically spread across zones. [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#pods-are-spread-across-zones) | |
Kubernetes can [automatically spread Pods across zone](/docs/setup/best-practices/multiple-zones/#pods-are-spread-across-zones). |
(I wouldn't mention ReplicationController; it's deprecated - I also have an open PR to drop mentioning it in the multiple zones page).
|
||
> Pods in a replication controller or service are automatically spread across zones. [docs](https://kubernetes.io/docs/setup/best-practices/multiple-zones/#pods-are-spread-across-zones) | ||
|
||
> For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}. [docs](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}. [docs](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees) | |
For a StatefulSet with _n_ replicas, when Pods are being deployed, they are created sequentially, in order from `{0..n-1}`. See [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees) for more details. |
|
||
Worse yet, our automation, at the time, would remove Nodes A-2, B-2, and C-2. Leaving CRDB-1 in an unscheduled state as persistent volumes are only available in the zone they are initially created in. | ||
|
||
To correct the latter issue, we now employ a "hunt and peck" approach to removing machines from a cluster. Rather than blindly removing kubernetes nodes from the cluster, only nodes without a CockroachDB pod would be removed. The much more daunting task was to wrangle the kubernetes scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To correct the latter issue, we now employ a "hunt and peck" approach to removing machines from a cluster. Rather than blindly removing kubernetes nodes from the cluster, only nodes without a CockroachDB pod would be removed. The much more daunting task was to wrangle the kubernetes scheduler. | |
To correct the latter issue, we now employ a "hunt and peck" approach to removing machines from a cluster. Rather than blindly removing Kubernetes nodes from the cluster, only nodes without a CockroachDB pod would be removed. The much more daunting task was to wrangle the Kubernetes scheduler. |
Two more things:
|
I signed the CLA a year or so ago, just need to update my affiliations. I'll work with @jess-edwards to fix the directory structure. Thanks for all the feedback @sftim! |
@sftim I've incorporated all your feedback and squashed everything down to a single commit. Thank you again for the review! Everything seems to build fine locally, would you mind giving it a second pass? I imagine we'll need to update the date as well? |
Assigning @sftim since he is heavily engaged on this. /cc @kbarnard10 |
This is a blog article - I'd rather leave this for others to take forward (including, but not only, blog team people). |
Hey team - any news on this post? Anything else needed on our end? |
Reaching out to the blog team, |
@kbarnard10 Any word on when this might go out? |
@sftim Could you help us get eyes on the blog post? You were so quick with the review (thank you!) and we've been stuck for over a month now. |
Hello. I'll reread through the changes. The blog team needs to approve this addition to the Kubernetes blog. |
|
||
(https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/) | ||
|
||
At its core, this entire issue was a misunderstanding with the guarantees that kube-scheduler provided us with. Why not provide ourselves with the original guarantee that we were looking for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Line 85, Could you remove these two sentences or reword to state that you decided to implement a custom scheduler for this unique case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the sentences and the dangling link above them! I tried to reword it but I think the following sentence is all that was needed.
|
||
At its core, this entire issue was a misunderstanding with the guarantees that kube-scheduler provided us with. Why not provide ourselves with the original guarantee that we were looking for? | ||
|
||
Thanks to an example from [Kelsey Hightower](https://github.com/kelseyhightower/scheduler) and a blog post from [Banzai Cloud](https://banzaicloud.com/blog/k8s-custom-scheduler/), we decided to dive in head first and write our own [custom Kubernetes scheduler](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/). Once our proof-of-concept was deployed and running, we quickly discovered that the Kubernetes scheduler is also responsible for mapping persistent volumes to the Pods that it schedules, which wasn't clear to us from the [documentation](/docs/concepts/scheduling-eviction/) about scheduling, nor evident from running [`kubectl get events`](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#verifying-that-the-pods-were-scheduled-using-the-desired-schedulers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not sure if this phrase adds much. Could you reword or update the section:
which wasn't clear to us from the [documentation](/docs/concepts/scheduling-eviction/) about scheduling, nor evident from running [`kubectl get events`](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#verifying-that-the-pods-were-scheduled-using-the-desired-schedulers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I've reworded it to show case that our confusion is what caused us to stumble upon the plugin system.
Thanks to an example from [Kelsey Hightower](https://github.com/kelseyhightower/scheduler) and a blog post from [Banzai Cloud](https://banzaicloud.com/blog/k8s-custom-scheduler/), we decided to dive in head first and write our own [custom Kubernetes scheduler](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/). Once our proof-of-concept was deployed and running, we quickly discovered that the Kubernetes scheduler is also responsible for mapping persistent volumes to the Pods that it schedules, which wasn't clear to us from the [documentation](/docs/concepts/scheduling-eviction/) about scheduling, nor evident from running [`kubectl get events`](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#verifying-that-the-pods-were-scheduled-using-the-desired-schedulers). | ||
In our journey to find the component responsible for storage claim mapping, we discovered the [kube-scheduler plugin system](/docs/concepts/scheduling-eviction/scheduling-framework/). Our next POC was a `Filter` plugin that determined the appropriate availability zone by pod ordinal, and it worked flawlessly! | ||
|
||
We open sourced our [custom scheduler plugin](https://github.com/cockroachlabs/crl-scheduler), and it is now running in all of our CockroachCloud clusters! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could reword:
Our custom scheduler plugin is open source and runs in all of our CockroachCloud clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the suggestion!
We open sourced our [custom scheduler plugin](https://github.com/cockroachlabs/crl-scheduler), and it is now running in all of our CockroachCloud clusters! | ||
Having control over how our StatefulSet pods are being scheduled has let us scale out with confidence. | ||
We may look into retiring our plugin once pod topology spread constraints are available in GKE and EKS, but the maintenance overhead has been surprisingly low. | ||
Better still: integration into our existing deployments and codebase was minor modification to the StatefulSet definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reword, not sure about this sentence
... integration into our existing deployments and codebase required minor modification to the StatefulSet definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to:
Better still: the plugin's implementation is orthogonal to our business logic. Deploying it, or retiring it for that matter, is as simple as changing the
schedulerName
field in our StatefulSet definitions.
|
||
### 1. Upgrade to kubernetes 1.18 and make use of Pod Topology Spread Constraints | ||
|
||
While this seems like it could have been the perfect solution, at the time of writing Kubernetes 1.18 was unavailable on the two most common managed Kubernetes services in public cloud, EKS and GKE. Furthermore, [pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/) were still a beta feature which meant that it wasn't guaranteed to be available in managed clusters even when v1.18 became available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this still true?
nit: What is the state of pod topology spread constraints in v1.19? The v1.20 release should occur before the end of the year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The post is largely about our decision making at that point in time, is it worth mentioning the state of pod topology spread constraints in v1.19+?
I've added a link to the documentation for 1.18 which make make this a bit more clear?
Hi @kbarnard10 @onlydole @mrbobbytables @parispittman -- Thanks for all your help with getting @chrisseto's blog post live. Could we get an estimate of when it will be published? If there's no room for it in your calendar, please let us know, and we can find another home for it. Thank you again, |
I'm going to follow up some actions to see if we can get more people reviewing and approving blog posts, including this one. Sorry about the hold up. |
@sftim thank you thank you! |
Hi @sftim -- cc @chrisseto |
We need to run through the k8s release blogs, but we should be able to after that. |
@mrbobbytables Thank you! Can you get me an approximate date of publish? I've got it slated to run on our blog on Jan. 7 but I can hold off depending on when the k8s release blogs are planned. My biggest concern is that the longer @chrisseto's post sits in purgatory (I opened this PR in September), the more likely K8s will release something that makes it irrelevant or requires him to rewrite significant chunks of it. |
@jess-edwards I'm going to get this scheduled for Monday, December 21st. Can you confirm that works for you? /lgtm |
LGTM label has been added. Git tree hash: 4b79591834e31617f4abc873453c6e571560be78
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kbarnard10 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kbarnard10 Dec. 21 would be great. Thank you so much! |
This didn't have its hold released yesterday, going to go ahead and do that now 👍 |
Hi K8s team,
This PR is to add a new blog post: "A Custom Kubernetes Scheduler for Highly Available Apps" by Chris Seto (@chrisseto) at Cockroach Labs.
The PR includes 7 png images, which 🙏 have been included correctly. All have alt titles for accessibility.
I goofed on titling the subdirectory, so it truncated the folder name to 2020-09-25-custom-k. If you can help to rename the folder, or direct me on how to rename the folder, I'd be grateful. Everything I've found on Stack is out of date.
Cheers,
Jessica
cc @charlotte-dillon to follow progress