Skip to content

Commit

Permalink
KEP pid limiting
Browse files Browse the repository at this point in the history
  • Loading branch information
derekwaynecarr committed Feb 1, 2019
1 parent 46941fc commit 63d361a
Showing 1 changed file with 186 additions and 0 deletions.
186 changes: 186 additions & 0 deletions keps/sig-node/20190129-pid-limiting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
---
kep-number: 34
title: Pid Limiting
authors:
- "@derekwaynecarr"
- "@dims"
owning-sig: sig-node
participating-sigs:
reviewers:
- "@dashpole"
approvers:
- "@dashpole"
- "@dchen1107"
editor: Derek Carr
creation-date: 2019-01-29
last-updated: 2019-01-29
status: implementable
see-also:
replaces:
superseded-by:
---

# Pid Limiting

## Table of Contents


* [Pid Limiting](#pid-limiting)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [User Stories [optional]](#user-stories-optional)
* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
* [Pod to Pod Isolation](#pod-to-pod-isolation)
* [Node to Pod Isolation](#node-to-pod-isolation)
* [Cgroup Enforcement](#cgroup-enforcement)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Pod to Pod pid isolation](#pod-to-pod-pid-isolation)
* [Node to Pod pid isolation](#node-to-pod-pid-isolation)
* [Implementation History](#implementation-history)
* [Version 1.10](#version-110)
* [Version 1.14](#version-114)

Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Summary

A proposal to enable isolation of pid resources. It proposes a mechanism to
enable pod-to-pod PID isolation as well as node-to-pod PID isolation.

## Motivation

Pids are a fundamental resource on Linux hosts. It is trivial to hit the task
limit without hitting any other resource limits and cause instability to a host
machine.

Administrators require mechanisms to ensure that user pods cannot induce pid
exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In
addition, it is important to ensure that pids are limited among pods in order to
ensure they have limited impact to other workloads on the node.

### Goals

This proposal aims to the following:
- enable administrator control to provide pod-to-pod pid isolation
- enable administrator control to provide node-to-pod pid isolation

### Non-Goals

This proposal defers the following:
- ability for a user to request additional number of pid resources per pod

It is anticipated we will support that via a policy knob that could be
restricted and/or defaulted via PodSecurityPolicy or LimitRange. We anticipate
tracking this work under a separate feature gate `GranularPidLimitsPerPod`. Any
defaulting applied to pods today would only be used if the pod had no local pod
pid limiting policy in future dates.

## Proposal

### User Stories [optional]

1. Administrator can default the number of pids per pod to provide pod-to-pod
isolation.
1. Administrator can reserve a number of allocatable pids to user pods via node
allocatable.

### Implementation Details/Notes/Constraints [optional]

#### Pod to Pod Isolation

To enable pid isolation among pods, the `SupportPodPidsLimit` feature gate is
defined.

If enabled, the kubelet argument for `pod-max-pids` will write out the
configured pid limit to the pod level cgroup to the value specified on Linux
hosts. If -1, the kubelet will default to the node allocatable pid capacity.

#### Node to Pod Isolation

To enable pid isolation from node to pods, the `SupportNodePidsLimit` feature
gate is proposed. If enabled, pid reservations may be supported at the node
allocatable and eviction manager subsystem configurations.

Node allocatable is a well-established feature concept in the kubelet that
allows isolation of user pod resources from host daemons at the `kubepods`
cgroup level that parents all end-user pods.

The kubelet will be updated to support reservation of pids so the effective pid
limit is enabled as follows:

```
[Allocatable] = [Node Capacity] -
[Kube-Reserved] -
[System-Reserved] -
[Hard-Eviction-Threshold]
```

#### Cgroup Enforcement

To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
`pids` cgroup must be mounted.

The `kubepods` cgroup is bounded by the `Allocatable` value.

The QoS level cgroups are left unbounded across all pid pool sizes.

The pod level cgroup sandbox is configured as follows:

1. the pod-max-pids value if positive and is specified on kubelet config
1. the local pod pid limiting policy (future)
1. unbounded (so it is restricted by the `Allocatable` value at `kubepods`)

### Risks and Mitigations

None

## Graduation Criteria

### Pod to Pod pid isolation

The following criteria applies to `SupportPodPidsLimit` feature gate:

Alpha
- basic support integrated in kubelet

Beta
- ensure proper node e2e test coverage is integrated verifying cgroup settings
- see testing:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/pids_test.go

GA
- assuming no negative user feedback based on production experience, promote
after 2 releases in beta.

### Node to Pod pid isolation

Adding support for pid limiting at the Node Allocatable level

The following criteria applies to `SupportNodePidsLimit`:

Alpha
- basic support integrated via eviction manager and/or node allocatable level

Beta
- ensure proper node e2e testing coverage to ensure a pod is unable to fork-bomb
a node even when `pod-max-pids` is unbounded.

GA
- assuming no negative user feedback, promote after 1 release at beta.

## Implementation History

### Version 1.10

`SupportPodPidsLimit` implemented at Alpha.

### Version 1.14

- Plan to implement `SupportNodePidsLimit` as Alpha.
- Graduate `SupportPodPidsLimit` to Beta by adding node e2e test coverage for
pid cgroup isolation, ensure PidPressure works as intended.

0 comments on commit 63d361a

Please sign in to comment.