diff --git a/keps/sig-node/20190129-pid-limiting.md b/keps/sig-node/20190129-pid-limiting.md new file mode 100644 index 00000000000..d45d7ef6916 --- /dev/null +++ b/keps/sig-node/20190129-pid-limiting.md @@ -0,0 +1,186 @@ +--- +kep-number: 34 +title: Pid Limiting +authors: + - "@derekwaynecarr" + - "@dims" +owning-sig: sig-node +participating-sigs: +reviewers: + - "@dashpole" +approvers: + - "@dashpole" + - "@dchen1107" +editor: Derek Carr +creation-date: 2019-01-29 +last-updated: 2019-01-29 +status: implementable +see-also: +replaces: +superseded-by: +--- + +# Pid Limiting + +## Table of Contents + + + * [Pid Limiting](#pid-limiting) + * [Table of Contents](#table-of-contents) + * [Summary](#summary) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) + * [Proposal](#proposal) + * [User Stories [optional]](#user-stories-optional) + * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + * [Pod to Pod Isolation](#pod-to-pod-isolation) + * [Node to Pod Isolation](#node-to-pod-isolation) + * [Cgroup Enforcement](#cgroup-enforcement) + * [Risks and Mitigations](#risks-and-mitigations) + * [Graduation Criteria](#graduation-criteria) + * [Pod to Pod pid isolation](#pod-to-pod-pid-isolation) + * [Node to Pod pid isolation](#node-to-pod-pid-isolation) + * [Implementation History](#implementation-history) + * [Version 1.10](#version-110) + * [Version 1.14](#version-114) + +Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) + +## Summary + +A proposal to enable isolation of pid resources. It proposes a mechanism to +enable pod-to-pod PID isolation as well as node-to-pod PID isolation. + +## Motivation + +Pids are a fundamental resource on Linux hosts. It is trivial to hit the task +limit without hitting any other resource limits and cause instability to a host +machine. + +Administrators require mechanisms to ensure that user pods cannot induce pid +exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In +addition, it is important to ensure that pids are limited among pods in order to +ensure they have limited impact to other workloads on the node. + +### Goals + +This proposal aims to the following: +- enable administrator control to provide pod-to-pod pid isolation +- enable administrator control to provide node-to-pod pid isolation + +### Non-Goals + +This proposal defers the following: +- ability for a user to request additional number of pid resources per pod + +It is anticipated we will support that via a policy knob that could be +restricted and/or defaulted via PodSecurityPolicy or LimitRange. We anticipate +tracking this work under a separate feature gate `GranularPidLimitsPerPod`. Any +defaulting applied to pods today would only be used if the pod had no local pod +pid limiting policy in future dates. + +## Proposal + +### User Stories [optional] + +1. Administrator can default the number of pids per pod to provide pod-to-pod + isolation. +1. Administrator can reserve a number of allocatable pids to user pods via node + allocatable. + +### Implementation Details/Notes/Constraints [optional] + +#### Pod to Pod Isolation + +To enable pid isolation among pods, the `SupportPodPidsLimit` feature gate is +defined. + +If enabled, the kubelet argument for `pod-max-pids` will write out the +configured pid limit to the pod level cgroup to the value specified on Linux +hosts. If -1, the kubelet will default to the node allocatable pid capacity. + +#### Node to Pod Isolation + +To enable pid isolation from node to pods, the `SupportNodePidsLimit` feature +gate is proposed. If enabled, pid reservations may be supported at the node +allocatable and eviction manager subsystem configurations. + +Node allocatable is a well-established feature concept in the kubelet that +allows isolation of user pod resources from host daemons at the `kubepods` +cgroup level that parents all end-user pods. + +The kubelet will be updated to support reservation of pids so the effective pid +limit is enabled as follows: + +``` +[Allocatable] = [Node Capacity] - + [Kube-Reserved] - + [System-Reserved] - + [Hard-Eviction-Threshold] +``` + +#### Cgroup Enforcement + +To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the +`pids` cgroup must be mounted. + +The `kubepods` cgroup is bounded by the `Allocatable` value. + +The QoS level cgroups are left unbounded across all pid pool sizes. + +The pod level cgroup sandbox is configured as follows: + +1. the pod-max-pids value if positive and is specified on kubelet config +1. the local pod pid limiting policy (future) +1. unbounded (so it is restricted by the `Allocatable` value at `kubepods`) + +### Risks and Mitigations + +None + +## Graduation Criteria + +### Pod to Pod pid isolation + +The following criteria applies to `SupportPodPidsLimit` feature gate: + +Alpha +- basic support integrated in kubelet + +Beta +- ensure proper node e2e test coverage is integrated verifying cgroup settings +- see testing: +https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/pids_test.go + +GA +- assuming no negative user feedback based on production experience, promote + after 2 releases in beta. + +### Node to Pod pid isolation + +Adding support for pid limiting at the Node Allocatable level + +The following criteria applies to `SupportNodePidsLimit`: + +Alpha +- basic support integrated via eviction manager and/or node allocatable level + +Beta +- ensure proper node e2e testing coverage to ensure a pod is unable to fork-bomb + a node even when `pod-max-pids` is unbounded. + +GA +- assuming no negative user feedback, promote after 1 release at beta. + +## Implementation History + +### Version 1.10 + +`SupportPodPidsLimit` implemented at Alpha. + +### Version 1.14 + +- Plan to implement `SupportNodePidsLimit` as Alpha. +- Graduate `SupportPodPidsLimit` to Beta by adding node e2e test coverage for + pid cgroup isolation, ensure PidPressure works as intended. \ No newline at end of file