Add user docs for pod priority and preemption (#5328)

* Add user docs for pod priority and preemption * Update pod-priority-preemption.md * More updates
kubernetes · Sep 13, 2017 · a30edd4 · a30edd4
1 parent 2a7a878
commit a30edd4
Show file tree

Hide file tree

Showing 2 changed files with 245 additions and 0 deletions.
diff --git a/_data/concepts.yml b/_data/concepts.yml
@@ -61,6 +61,7 @@ toc:
   - docs/concepts/configuration/taint-and-toleration.md
   - docs/concepts/configuration/secret.md
   - docs/concepts/configuration/organize-cluster-access-kubeconfig.md
+  - docs/concepts/configuration/pod-priority-preemption.md
 
 - title: Services, Load Balancing, and Networking
   section:

diff --git a/docs/concepts/configuration/pod-priority-preemption.md b/docs/concepts/configuration/pod-priority-preemption.md
@@ -0,0 +1,244 @@
+---
+approvers:
+- davidopp
+- wojtek-t
+title: Pod Priority and Preemption
+---
+
+{% capture overview %}
+
+{% include feature-state-alpha.md %}
+
+[Pods](/docs/user-guide/pods) in Kubernetes 1.8 and later can have priority. Priority
+indicates the importance of a Pod relative to other Pods. When a Pod cannot be scheduled,
+the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the
+pending Pod possible. In a future Kubernetes release, priority will also affect
+out-of-resource eviction ordering on the Node.
+
+**Note:** Preemption does not respect PodDisruptionBudget; see 
+[the limitations section](#poddisruptionbudget-is-not-supported) for more details.
+{: .note}
+
+{% endcapture %}
+
+{% capture body %}
+
+## How to use priority and preemption
+To use priority and preemption in Kubernetes 1.8, follow these steps:
+
+1. Enable the feature.
+
+1. Add one or more PriorityClasses.
+
+1. Create Pods with `PriorityClassName` set to one of the added PriorityClasses.
+Of course you do not need to create the Pods directly; normally you would add 
+`PriorityClassName` to the Pod template of a collection object like a Deployment.
+
+The following sections provide more information about these steps.
+
+## Enabling priority and preemption
+
+Pod priority and preemption is disabled by default in Kubernetes 1.8.
+To enable the feature, set this command-line flag for the API server 
+and the scheduler:
+
+```
+--feature-gates=PodPriority=true
+```
+
+Also set this flag for API server:
+
+
+```
+--runtime-config=scheduling.k8s.io/v1alpha1=true
+```
+
+After the feature is enabled, you can create [PriorityClasses](#priorityclass)
+and create Pods with [`PriorityClassName`](#pod-priority) set.
+
+If you try the feature and then decide to disable it, you must remove the PodPriority
+command-line flag or set it to false, and then restart the API server and
+scheduler. After the feature is disabled, the existing Pods keep their priority
+fields, but preemption is disabled, and priority fields are ignored, and you
+cannot set PriorityClassName in new Pods.
+
+## PriorityClass
+
+A PriorityClass is a non-namespaced object that defines a mapping from a priority
+class name to the integer value of the priority. The name is specified in the `name`
+field of the PriorityClass object's metadata. The value is specified in the required
+`value` field. The higher the value, the higher the priority. 
+
+A PriorityClass object can have any 32-bit integer value smaller than or equal to
+1 billion. Larger numbers are reserved for critical system Pods that should not
+normally be preempted or evicted. A cluster admin should create one PriorityClass
+object for each such mapping that they want.
+
+PriorityClass also has two optional fields: `globalDefault` and `description`.
+The `globalDefault` field indicates that the value of this PriorityClass should
+be used for Pods without a `PriorityClassName`. Only one PriorityClass with
+`globalDefault`  set to true can exist in the system. If there is no PriorityClass
+with `globalDefault` set, the priority of Pods with no `PriorityClassName` is zero.
+
+The `description` field is an arbitrary string. It is meant to tell users of
+the cluster when they should use this PriorityClass.
+
+**Note 1**: If you upgrade your existing cluster and enable this feature, the priority
+of your existing Pods will be considered to be zero.
+{: .note}
+
+**Note 2**: Addition of a PriorityClass with `globalDefault` set to true does not
+change the priorities of existing Pods. The value of such a PriorityClass is used only
+for Pods created after the PriorityClass is added.
+{: .note}
+
+**Note 3**: If you delete a PriorityClass, existing Pods that use the name of the
+deleted priority class remain unchanged, but you are not able to create more Pods
+that use the name of the deleted PriorityClass.
+{: .note}
+
+### Example PriorityClass
+
+```yaml
+apiVersion: v1
+kind: PriorityClass
+metadata:
+  name: high-priority
+value: 1000000
+globalDefault: false
+description: "This priority class should be used for XYZ service pods only."
+```
+
+## Pod priority
+
+After you have one or more PriorityClasses, you can create Pods that specify one
+of those PriorityClass names in their specifications. The priority admission
+controller uses the `priorityClassName` field and populates the integer value
+of the priority. If the priority class is not found, the Pod is rejected.
+
+The following YAML is an example of a Pod configuration that uses the PriorityClass
+created in the preceding example. The priority admission controller checks the
+specification and resolves the priority of the Pod to 1000000.
+
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nginx
+  labels:
+    env: test
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    imagePullPolicy: IfNotPresent
+  priorityClassName: high-priority
+```
+
+## Preemption
+
+When Pods are created, they go to a queue and wait to be scheduled. The scheduler
+picks a Pod from the queue and tries to schedule it on a Node. If no Node is found
+that satisfies all the specified requirements of the Pod, preemption logic is triggered 
+for the pending Pod. Let's call the pending pod P. Preemption logic tries to find a Node
+where removal of one or more Pods with lower priority than P would enable P to be scheduled
+on that Node. If such a Node is found, one or more lower priority Pods get
+deleted from the Node. After the Pods are gone, P can be scheduled on the Node. 
+
+### Limitations of preemption (alpha version)
+
+#### Starvation of preempting Pod
+
+When Pods are preempted, the victims get their
+[graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
+They have that much time to finish their work and exit. If they don't, they are
+killed. This graceful termination period creates a time gap between the point
+that the scheduler preempts Pods and the time when the pending Pod (P) can be
+scheduled on the Node (N). In the meantime, the scheduler keeps scheduling other
+pending Pods. As victims exit or get terminated, the scheduler tries to schedule
+Pods in the pending queue, and one or more of them may be considered and
+scheduled to N before the scheduler considers scheduling P on N. In such a case,
+it is likely that when all the victims exit, Pod P won't fit on Node N anymore.
+So, scheduler will have to preempt other Pods on Node N or another Node so that
+P can be scheduled. This scenario might be repeated again for the second and
+subsequent rounds of preemption, and P might not get scheduled for a while.
+This scenario can cause problems in various clusters, but is particularly
+problematic in clusters with a high Pod creation rate.
+
+We will address this problem in the beta version of Pod preemption. The solution
+we plan to implement is
+[provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-preemption.md#preemption-mechanics).
+
+#### PodDisruptionBudget is not supported
+
+A [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
+allows application owners to limit the number Pods of a replicated application that
+are down simultaneously from voluntary disruptions. However, the alpha version of
+preemption does not respect PDB when choosing preemption victims.
+We plan to add PDB support in beta, but even in beta, respecting PDB will be best
+effort. The Scheduler will try to find victims whose PDB won't be violated by preemption,
+but if no such victims are found, preemption will still happen, and lower priority Pods
+will be removed despite their PDBs  being violated.
+
+#### Inter-Pod affinity on lower-priority Pods
+
+In version 1.8, a Node is considered for preemption only when
+the answer to this question is yes: "If all the Pods with lower priority than
+the pending Pod are removed from the Node, can the pending pod be scheduled on
+the Node?"
+
+**Note:** Preemption does not necessarily remove all lower-priority Pods. If the 
+pending pod can be scheduled by removing fewer than all lower-priority Pods, then
+only a portion of the lower-priority Pods are removed. Even so, the answer to the
+preceding question must be yes. If the answer is no, the Node is not considered
+for preemption.
+{: .note}
+
+If a pending Pod has inter-pod affinity to one or more of the lower-priority Pods
+on the Node, the inter-Pod affinity rule cannot be satisfied in the absence of those
+lower-priority Pods. In this case, the scheduler does not preempt any Pods on the
+Node. Instead, it looks for another Node. The scheduler might find a suitable Node
+or it might not. There is no guarantee that the pending Pod can be scheduled.
+
+We might address this issue in future versions, but we don't have a clear plan yet.
+We will not consider it a blocker for Beta or GA. Part
+of the reason is that finding the set of lower-priority Pods that satisfy all
+inter-Pod affinity rules is computationally expensive, and adds substantial 
+complexity to the preemption logic. Besides, even if preemption keeps the lower-priority
+Pods to satisfy inter-Pod affinity, the lower priority Pods might be preempted
+later by other Pods, which removes the benefits of having the complex logic of 
+respecting inter-Pod affinity.
+
+Our recommended solution for this problem is to create inter-Pod affinity only towards
+equal or higher priority pods.
+
+#### Cross node preemption
+
+Suppose a Node N is being considered for preemption so that a pending Pod P
+can be scheduled on N. P might become feasible on N only if a Pod on another
+Node is preempted. Here's an example:
+
+* Pod P is being considered for Node N.
+* Pod Q is running on another Node in the same zone as Node N.
+* Pod P has anit-affinity with Pod Q.
+* There are no other cases of anti-affinity between Pod P and other Pods in the zone.
+* In order to schedule Pod P on Node N, Pod Q should be preempted, but scheduler
+does not perform cross-node preemption. So, Pod P will be deemed unschedulable
+on Node N.
+
+If Pod Q were removed from its Node, the anti-affinity violation would be gone,
+and Pod P could possibly be scheduled on Node N.
+
+We may consider adding cross Node preemption in future versions if we find an
+algorithm with reasonable performance. We cannot promise anything at this point, 
+and cross Node preemption will not be considered a blocker for Beta or GA.
+
+{% endcapture %}
+
+{% capture whatsnext %}
+* Learn more about [this](...).
+* See this [related task](...).
+{% endcapture %}
+
+{% include templates/concept.md %}