From 9b6913ce1fe67f8f2ec22aedf9d38177549aba83 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Johannes=20W=C3=BCrbach?= Date: Sat, 26 Sep 2020 21:33:11 +0200 Subject: [PATCH] Initial KEP-2021 draft --- keps/prod-readiness/sig-autoscaling/2021.yaml | 3 + .../2021-scale-from-zero/README.md | 851 ++++++++++++++++++ .../2021-scale-from-zero/kep.yaml | 41 + 3 files changed, 895 insertions(+) create mode 100644 keps/prod-readiness/sig-autoscaling/2021.yaml create mode 100644 keps/sig-autoscaling/2021-scale-from-zero/README.md create mode 100644 keps/sig-autoscaling/2021-scale-from-zero/kep.yaml diff --git a/keps/prod-readiness/sig-autoscaling/2021.yaml b/keps/prod-readiness/sig-autoscaling/2021.yaml new file mode 100644 index 00000000000..beff2f0ae9c --- /dev/null +++ b/keps/prod-readiness/sig-autoscaling/2021.yaml @@ -0,0 +1,3 @@ +kep-number: 2021 +beta: + approver: "@johnbelamaric" diff --git a/keps/sig-autoscaling/2021-scale-from-zero/README.md b/keps/sig-autoscaling/2021-scale-from-zero/README.md new file mode 100644 index 00000000000..070d55281d4 --- /dev/null +++ b/keps/sig-autoscaling/2021-scale-from-zero/README.md @@ -0,0 +1,851 @@ + +# KEP-2021: HPA supports scaling to/from zero pods for object/external metrics + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Scale a heavy queue consumer on-demand](#story-1-scale-a-heavy-queue-consumer-on-demand) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha -> Beta Graduation](#alpha---beta-graduation) + - [Beta -> GA Graduation](#beta---ga-graduation) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +[Horizontal Pod Autoscaler][] (HPA) automatically scales the number of pods in any resource which supports the `scale` subresource based on observed CPU utilization +(or, with custom metrics support, on some other application-provided metrics) from one to many replicas. This proposal adds support for scaling from zero to many replicas and back to zero for object and external metrics. + +[Horizontal Pod Autoscaler]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ + + + +## Motivation + + + +With the addition of scaling based on object and external metrics it became possible to automatically adjust the number of running replicas based on an application +provided metric. A typical use-case case for this is scaling the number of queue consumers based on the length of the consumed queue. + +In cases of a frequently idle queue or a less latency sensitive workload there is no need to run one replica at all times and instead you want to dynamically scale +to zero replicas, especially if those replicas have high resource requests. If replicas are scale to 0, HPA also needs the ability to scale up once messages are available. + +### Goals + + + +* Provide scaling to zero replicas for object and external metrics +* Provide scaling from zero replicas for object and external metrics + +### Non-Goals + + + +* Provide scaling to/from zero replicas for resource metrics +* Provide request buffering at the Kubernetes Service level + +## Proposal + + + +Allow the HPA to scale from and to zero using `minReplicas: 0` when explicitly enabled with a flag. + +### User Stories (Optional) + + + +#### Story 1: Scale a heavy queue consumer on-demand + +As the operator of a video processing pipeline, I would like to reduce costs. While video processing is CPU intensive, it is not a latency sensitive workload. Therefor I want +my video processing workers to only be created if there is actually a video to be processed and terminated afterwards. + + +### Notes/Constraints/Caveats (Optional) + + + +Currently disabling HPA is possible by manually setting the scaled resource to `replicas: 0`. This works as the HPA itself could never reach this state itself. +As `replicas: 0` is now a possible state when using `minReplicas: 0` it can no longer be used to differentiate between manually disabled or automatically scaled to zero. + +Additionally the `replicas: 0` state is problematic as updating a HPA object `minReplicas` from `0` to `1` has different behavior. If `replicas` was `0` during the update, HPA +will be disabled for the resource, if it was `> 0`, HPA will continue with the new `minReplicas` value. + +To resolve this issue the KEP is introducing an explicit `enableScaleToZero` property to explicitly enable/disable scale from/to zero. + +### Risks and Mitigations + + + +From an UX perspective the two stage opt-out / opt-in from scale to zero might feel a bit tedious, but the only other available option seems to be deprecating the implicit HPA pause on `replicas: 0`. While this might provide an improved +UX, it would require a full deprecation cycle (12 months) before graduating this feature from alpha to beta. + +## Design Details + + + +We would add `EnableScaleToZero *bool` to the HPA `spec.behavior`. + +```golang +type HorizontalPodAutoscalerBehavior struct { + ScaleUp *HPAScalingRules + ScaleDown *HPAScalingRules + EnableScaleToZero *bool +} + +type HorizontalPodAutoscalerSpec struct { + ScaleTargetRef CrossVersionObjectReference + MinReplicas *int32 + MaxReplicas int32 + Metrics []MetricSpec + Behavior *HorizontalPodAutoscalerBehavior +} +``` + +The `EnableScaleToZero` controls whether the `MinReplicas` can set to `>=0` (`true`, new behavior) or `>=1` (`false`, current behavior). The default will be `false` to preserve the current behavior. + +If `EnableScaleToZero` has been enabled, it can only be disabled when the scaled resource has at least one replica +running and `MinReplicas` is `>=1`. + +### Test Plan + + + +Most logic related to this KEP is contained in the HPA controller so the testing of +the various `minReplicas`, `replicas` and `enableScaleToZero` should be achievable with unit tests. + +Additionally integration tests should be added for enable scale to zero by, setting + `enableScaleToZero: true`, setting `minReplicas: 1` and waiting for `replicas` to become `0` and another test for increasing `minReplicas: 1` and observing that `replicas` became `1` again and , setting `enableScaleToZero: false` afterwards. + +### Graduation Criteria + + + +#### Alpha -> Beta Graduation + +- Implement the `enableScaleToZero` property +- Ensure that all `minReplicas` state transitions from `0` to `1` are working as expected + +#### Beta -> GA Graduation + +- Allowing time for feedback +- E2E tests are passing without flakiness + +### Upgrade / Downgrade Strategy + + + +As this KEP changes the allowed values for `minReplicas`, special care is required for the downgrade case to not prevent any kind of updates for HPA objects using `minReplicas: 0`. The alpha code already accepts `minReplicas: 0` with the flag enabled or disabled since Kubernetes version 1.16 downgrades to any version >= 1.16 aren't an issue. + +The new flag `enableScaleToZero` defaults to `false`, which was has been the previous behavior. In flag should be disabled before downgrading as otherwise the +HPA for deployments with zero replicas will be disabled until replicas have been +raised explicitly to at least `1`. + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + + - [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `HPAScaleToZero` + - Components depending on the feature gate: `kube-apiserver` + - [x] Other + - Describe the mechanism: + + When HPAScaleToZero feature gate is enabled HPA supports scaling to zero pods based on object or external metrics. HPA remains active as long as at least one metric value available. + + - Will enabling / disabling the feature require downtime of the control + plane? + + No + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + + No + +###### Does enabling the feature change any default behavior? + + + + Any change of default behavior may be surprising to users or break existing + automations, so be extremely careful here. + + HPA creation/update with `minReplicas: 0` is no longer rejected, if the `enableScaleToZero` field is set to true. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + + Also set `disable-supported` to `true` or `false` in `kep.yaml`. + Describe the consequences on existing workloads (e.g., if this is a runtime + feature, can it break the existing applications?). + + Yes. To downgrade the cluster to version that does not support scale-to-zero feature: + + 1. Make sure there are no hpa objects with minReplicas=0 and maxReplicas=0. Here is a oneliner to update it to 1: + + `$ kubectl get hpa --all-namespaces --no-headers=true | awk '{if($6==0) printf "kubectl patch hpa/%s --namespace=%s -p \"{\\\"spec\\\":{\\\"minReplicas\\\":1,\\\"maxReplicas\\\":1}}\"\n", $2, $1 }' | sh` + 2. Disable `HPAScaleToZero` feature gate + 3. In case step 1. has been omitted, workloads might be stuck with `replicas: 0` and need to be manually scaled up to `replicas: 1` to re-enable autoscaling. + +###### What happens if we reenable the feature if it was previously rolled back? + + Nothing, the feature can be re-enabled without problems. + +###### Are there any tests for feature enablement/disablement? + + + +There currently unit tests for the alpha cases and tests planned to be added for the new functionality. + +### Rollout, Upgrade and Rollback Planning + + + +As this is a new field every usage is opt-in. In case the kubernetes version is downgraded, currently scaled to 0 workloads might not to be manually scaled to 1 as the controller would treat them as +paused otherwise. + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +There are no expected side-effects when the rollout fails as new `enableScaleToZero` flag should only be enabled once the version upgraded completed and should be disabled before attempting a rollback. + +In case this is missed, HPA for deployments with zero replicas will be disabled until replicas have been raised explicitly to at least `1`. + +###### What specific metrics should inform a rollback? + + + +If workloads aren't scaled up from 0 despite the scaling condition being meet, an operator should rollback this feature and manually scale an affected workload back to `1`. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +No yet as no implementation is available. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +Once this is in beta, the alpha flag can be removed. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +The feature is used if workloads are scaled to zero by the autoscaling controller. + +###### How can someone using this feature know that it is working for their instance? + + + +Similar to autoscaling is confirmed today. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +No changes to the autoscaling SLOs. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +No changes to the autoscaling SLIs. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +No, in regards to this KEP. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +The addition has the same dependencies as the currently autoscaling controller. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No, the amount of autoscaling related API calls will remain unchanged. No other components are affected. + +###### Will enabling / using this feature result in introducing new API types? + + + +No, this only modifies the existing API types. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No, the amount of autoscaling related cloud provider calls will remain unchanged. No other components are affected. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +Yes, one additional boolean field. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No, the are no visible latency changes expected for existing autoscaling operations. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +No, the are no visible changes expected for existing autoscaling operations. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +- (2019/02/25) Original design doc: https://github.com/kubernetes/kubernetes/issues/69687#issuecomment-467082733 +- (2019/07/16) Alpha implementation (https://github.com/kubernetes/kubernetes/pull/74526) merged for Kubernetes 1.16 + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-autoscaling/2021-scale-from-zero/kep.yaml b/keps/sig-autoscaling/2021-scale-from-zero/kep.yaml new file mode 100644 index 00000000000..4ae6f65299c --- /dev/null +++ b/keps/sig-autoscaling/2021-scale-from-zero/kep.yaml @@ -0,0 +1,41 @@ +title: HPA supports scaling to/from zero pods for object/external metrics +kep-number: 2021 +authors: + - johanneswuerbach +owning-sig: sig-autoscaling +participating-sigs: +status: implementable +creation-date: "2020-09-26" +reviewers: + - TBD +approvers: + - TBD + +see-also: +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.27" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.16" + beta: "v1.27" + stable: "x.y" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: HPAScaleToZero + components: + - kube-apiserver +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - TBD