Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasonable defaults with eviction and PodDisruptionBudget #35318

Open
foxish opened this issue Oct 21, 2016 · 50 comments
Open

Reasonable defaults with eviction and PodDisruptionBudget #35318

foxish opened this issue Oct 21, 2016 · 50 comments
Assignees
Labels
area/node-lifecycle Issues or PRs related to Node lifecycle area/stateful-apps lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@foxish
Copy link
Contributor

foxish commented Oct 21, 2016

We are moving to have kubectl drain use the eviction subresource (#34277), and honor PodDisruptionBudget.
From comment by @erictune which better captures this requirement.

There should be a default behavior for disruption budgets.
The default should optimize for these things:

  1. Users should not need to create, view, or even know about with the PodDisruptionBudget resource, and still get "reasonable" default behaviors.
  2. The default behavior should be reasonable for a wide variety of app use cases.
  3. The default behavior could be expressed succinctly to users.
  4. Cluster admins should be able to upgrade nodes quickly in clusters that have typical apps with default budgets, without interacting with app owners.

Some options:

  1. All pods in the cluster that match no user-created disruption budget have a budget of 1 at a time.
  • could be too conservative to allow for parallel node drains.
  1. All pods in a namespace cluster that match no user-created disruption budget have a budget of 1 at a time.
  • less conservative than 1
  • probably not too conservative for systems with many namespaces relative to number of nodes.
  1. Each set of pods with the same controller and that match no user-created disruption budget, have a disruption budget of 1 at a time.
  • assuming spreading of collections by scheduler, parallel node drains likely to be possible
  • the disruption controller already computes this grouping of pods, sort of.
  1. Each set of pods with the same service, that match no user-created disruption budget, have a disruption budget of 1 at a time.
  • matches a default spreading behavior by the scheduler, so unlikely to have two on the same machine from same service, so parallel node drains likely to be possible.

The actual implementation could be to have a controller that makes PDBs for things without them and then cleans them up when not needed anymore, or for the eviction logic to just compute these sets, but not expose them.

cc @erictune @davidopp @kow3ns @janetkuo @caesarxuchao @ymqytw @kubernetes/sig-apps

@davidopp
Copy link
Member

davidopp commented Oct 21, 2016

I'm on the fence on this one. Would be interested to hear other folks' opinions. Here are my thoughts:

It seems an alternative to your proposal, which would have the same effect, would be to just add an admission controller that adds a PDB to every collection that you think should have the rule you want (e.g. max 1 eviction at a time). The downside of this would be a proliferation of PDBs.

If we implement your proposal, clients that want to get the behavior they get today in the absence of a PDB, instead of the new behavior you're introducing, could just use regular delete. (Maybe their logic could be: try eviction subresource first, if it is rejected, then just do regular delete. That way they wouldn't need to look up whether the pod is covered by a PDB first.) So it wouldn't be a big inconvenience.

The one case I am worried about with your proposal is pods that are managed by a controller with replicas=1. Such a pod would block removing the node, even in the absence of a PDB. Maybe that is the right thing to do, but it would be better if we knew the user's intent, and the only way to know that would be to require them to put a PDB on it if they really don't want to allow such a node to be removed.

I guess I am leaning slightly towards endorsing your suggestion but there may be other issues I didn't think of.

cc/ @mml

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 21, 2016

What class of things run as rc=1, ps=1?

Important:
Databases
Leader services
The B of an A/B test


Not important:
Testing instances of production services
Random stuff you start under an rs (so it will survive node death) but for which you forget about.

In large clusters the latter are going to dwarf the former (many OpenShift clusters today are 95% the latter). If this change added the PR above, we'd probably revert it because it would break all of those very large, dense clusters (eviction would be impossible). On small clusters, it's much more likely you have the former - in which case this might be a reasonable default. But if you have a database or leader service and it doesn't have a PDB, you're just not using the system like you should.

The only downside today is that eviction can be surprising (if you are relying on emptydir), but you can't prevent that even with PDB, so I don't think it's unreasonable to ask users to use PDB to avoid disruption.

@smarterclayton
Copy link
Contributor

A global config flag to opt in to this makes sense, but I don't think that rs=1 or ps=1 implies disruption budget

@davidopp
Copy link
Member

(Maybe their logic could be: try eviction subresource first, if it is rejected, then just do regular delete. That way they wouldn't need to look up whether the pod is covered by a PDB first.) So it wouldn't be a big inconvenience.

Sorry, what I said here was wrong. If we implement your proposal, clients that want to always delete in the absence of a PDB but respect the PDB if it exists (which is the behavior they get today) would have to look up whether the pod is covered by a PDB first, and use that to decide where to use /eviction or regular delete.

A global config flag to opt in to this makes sense

I guess a config flag to choose the behavior would be one approach, though I always worry about the support implications of making core behaviors configurable...

@erictune
Copy link
Member

There should be a default behavior for disruption budgets.

The default should optimize for these things:

  1. Users should not need to create, view, or even know about with the PodDisruptionBudget resource, and still get "reasonable" default behaviors.
  2. The default behavior should be reasonable for a wide variety of app use cases.
  3. The default behavior could be expressed succinctly to users.
  4. Cluster admins should be able to upgrade nodes quickly in clusters that have typical apps with default budgets, without interacting with app owners.

Some options:

  1. All pods in the cluster that match no user-created disruption budget have a budget of 1 at a time.
    • could be too conservative to allow for parallel node drains.
  2. All pods in a namespace cluster that match no user-created disruption budget have a budget of 1 at a time.
    • less conservative than 1
    • probably not too conservative for systems with many namespaces relative to number of nodes.
  3. Each set of pods with the same controller and that match no user-created disruption budget, have a disruption budget of 1 at a time.
    • assuming spreading of collections by scheduler, parallel node drains likely to be possible
    • the disruption controller already computes this grouping of pods, sort of.
  4. Each set of pods with the same service, that match no user-created disruption budget, have a disruption budget of 1 at a time.
    • matches a default spreading behavior by the scheduler, so unlikely to have two on the same machine from same service, so parallel node drains likely to be possible.

The actual implementation could be to have a controller that makes PDBs for things without them and then cleans them up when not needed anymore, or for the eviction logic to just compute these sets, but not expose them.

@foxish foxish changed the title Eviction without a specified PodDisruptionBudget should default to maxUnavailable of 1 Reasonable defaults with eviction and PodDisruptionBudget Oct 22, 2016
@davidopp
Copy link
Member

davidopp commented Oct 22, 2016

I would add one more criteria: the intended behavior of the system should be as simple to understand as possible. In this vein, I think saying "absence of PDB means unlimited evictions" is less surprising than "absence of PDB is like presence of a PDB tied to " Thus we should make the behavior explicit by requiring a PDB whenever you want to limit evictions--at least from the standpoint of this criteria, which I think is an important one for debugability and for administrators to understand what's going on.

As for the options, it seems like (3) and (4) are clearly better than (1) and (2), or maybe I am missing something? I think maybe the right way to think about this feature in general is that one of the characteristics of a controller should be that it has a default policy for disruptions. Most controllers probably won't do anything; the default is "eviction is always allowed." Some, like PetSet, might want a policy (e.g. at most one down from the set at a time), and they can use an admission controller to apply it. The main issue I can see with this idea of requiring the default policies to be materialized, is that if the user wants their own policy, then they would presumably have to delete the PDB that was created by the admission controller before adding their own, which is annoying. The benefit of building a default policy into the PDB controller itself is that the user-supplied PDB can just override the default one that's built into the PDB, and the user doesn't have to delete anything to apply their own. But if you build the default policy into the PDB controller, you can't easily customize it on a per-kind basis (creator of a new kind would have to modify the PDB controller when they build a custom controller...ugh).

So both options (absence of PDB means unlimited evictions, and absence of PDB means some default policy is applied) seem to have downsides.

But because I think the default policy should probably depend on the type of controller, I think (3) is better than (4).

@davidopp
Copy link
Member

davidopp commented Oct 22, 2016

A couple of other reasons why (3) seems better than (4)

  • user may create service after creating controller, so there will be a period where the collection would be unprotected if you tie the PDB to the service (of course if the user wants to do it manually, that's fine, they are involved in the loop)
  • a pod may be in more than one service, but never have more than one controller; today PDB controller can't handle a pod being in more than one PDB-protected collection (and I'm not sure we want to ever allow it; the rules get dicey from a usability standpoint... it's basically an "and" of whether you're allowed by each PDB you're in, but I think this is a recipe for confusion, and race conditions debiting multiple PDBs for one eviction due to no transactions, etc.).

@kow3ns
Copy link
Member

kow3ns commented Oct 24, 2016

I think (3) provides a sane, easily comprehensible, default behavior

  1. It is generally safe, I can't think of a use case for controlled replication where the tenant application will not tolerate a single eviction. I'd argue that if your application isn't tolerant to a single eviction, when it is otherwise healthy, than it also isn't tolerant to a single node failure. If someone really does have this issue, they could explicitly set the maximum disruption budget to 0 to indicate that no Pod for this application should ever be purposefully evicted.
    2.It should allow for enough concurrent node drains, without respect to the kind of the controller, provided the user has a reasonable spread on the Pods associated with an individual controller.
    3.It can be easily summarized as "The system default behavior will allow at most one concurrent eviction per controller unless an explicit disruption budget is specified".

@mml
Copy link
Contributor

mml commented Oct 24, 2016

I can think of cases where this default would be a pain in the neck, so I agree with @smarterclayton that you should be able to toggle with a flag. I think opt-out (i.e. default to on) is fine.

Long run, we had previously discussed a way of generating template PDBs that some controller would use to generate real PDBs for matching controllers. The template PDBs could be thought of as specific service levels. E.g., for batch workloads an admin might want maxUnavailable of 95% or even 100%.

Is it worth it to implement this system instead of these hardcoded defaults and a flag?

@davidopp
Copy link
Member

I can think of cases where this default would be a pain in the neck, so I agree with @smarterclayton that you should be able to toggle with a flag. I think opt-out (i.e. default to on) is fine.

You mean toggle on a per-cluster basis? Isn't that going to cause a support nightmare?

Long run, we had previously discussed a way of generating template PDBs that some controller would use to generate real PDBs for matching controllers. The template PDBs could be thought of as specific service levels. E.g., for batch workloads an admin might want maxUnavailable of 95% or even 100%.

I admit I don't remember this discussion. Is the idea basically the same as (3), where you can set a per-kind default PDB rule?

Is it worth it to implement this system instead of these hardcoded defaults and a flag?

If each kind has an admission controller that creates the default PDB, you could set the default policy there, right?

@kow3ns
Copy link
Member

kow3ns commented Oct 24, 2016

@mml What behavior would you expect on opt-out? That eviction with no PDB is equivalent to deletion?
@smarterclayton If the default PDR is not global, but applied to all Pods in the same RS or PS as in (3 proposed by erictune), will that still be impossible for large dense clusters? Is your concern here primarily that evictions makes progress? For RS=1 or PS=1, using the default behavior described in (3), the single Pod will be immediately evicted and rescheduled.

@smarterclayton
Copy link
Contributor

What class of things run as rc=1, ps=1?

Important:
Databases
Leader services

The B of an A/B test

Not important:
Testing instances of production services
Random stuff you start under an rs (so it will survive node death) but for
which you forget about.

In large clusters the latter are going to dwarf the former (many OpenShift
clusters today are 95% the latter). If this change added the PR above,
we'd probably revert it because it would break all of those very large,
dense clusters (eviction would be impossible). On small clusters, it's
much more likely you have the former - in which case this might be a
reasonable default. But if you have a database or leader service and it
doesn't have a PDB, you're just not using the system like you should.

The only downside today is that eviction can be surprising (if you are
relying on emptydir), but you can't prevent that even with PDB, so I don't
think it's unreasonable to ask users to use PDB to avoid disruption.

On Fri, Oct 21, 2016 at 4:09 PM, David Oppenheimer <[email protected]

wrote:

I'm on the fence on this one. Would be interested to hear other folks'
opinions. Here are my thoughts:

It seems an alternative to your proposal, which would have the same
effect, would be to just add an admission controller that adds a PDB to
every collection that you think should have the rule you want (e.g. max 1
eviction at a time).

On the other hand, if we implement your proposal, clients that want to get
the behavior they get today in the absence of a PDB, could just use regular
delete. (Maybe their logic could be: try eviction subresource first, if it
is rejected, then just do regular delete. That way they wouldn't need to
look up whether the pod is covered by a PDB first.)

The one case I am worried about with your proposal is pods that are
managed by a controller with replicas=1. Such a pod would block removing
the node, even in the absence of a PDB. Maybe that is the right thing to
do, but it would be better if we knew the user's intent, and the only way
to know that would be to require them to put a PDB on it if they really
don't want to allow such a node to be removed.

cc/ @mml https://github.com/mml


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4NBaSO2J-FOFRl7Z1ymZ137FEmNks5q2RwKgaJpZM4Kdhb7
.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 25, 2016 via email

@smarterclayton
Copy link
Contributor

Is 3 disruptive to anyone with a large cluster today? Trying to consider
how someone may be rotating nodes in and out en masse where this would be
surprising / a problem. The solution is fairly easy (to create a PDB).

On Mon, Oct 24, 2016 at 2:35 PM, Kenneth Owens [email protected]
wrote:

I think (3) provides a sane, easily comprehensible, default behavior

  1. It is generally safe, I can't think of a use case for controlled
    replication where the tenant application will not tolerate a single
    eviction. I'd argue that if your application isn't tolerant to a single
    eviction, when it is otherwise healthy, than it also isn't tolerant to a
    single node failure. If someone really does have this issue, they could
    explicitly set the maximum disruption budget to 0 to indicate that no Pod
    for this application should ever be purposefully evicted.
    2.It should allow for enough concurrent node drains, without respect to
    the kind of the controller, provided the user has a reasonable spread on
    the Pods associated with an individual controller.
    3.It can be easily summarized as "The system default behavior will allow
    at most one concurrent eviction per controller unless an explicit
    disruption budget is specified".


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p6iDroUPt7f342Zs801WpWwblVrqks5q3PprgaJpZM4Kdhb7
.

@smarterclayton
Copy link
Contributor

Agree 3 sounds better than 4, and that both sound (at least naively) as
being better than 1 and 2.

On Sat, Oct 22, 2016 at 6:04 AM, David Oppenheimer <[email protected]

wrote:

A couple of other reasons why (3) seems better than (4)

  • user may create service after creating controller, so there will be
    a period where it would be unprotected
  • a pod may be in more than one service, but never have more than one
    controller; today PDB controller can't handle a pod being in more than one
    PDB-protected collection (and I'm not sure we want to ever allow it; the
    rules get dicey from a usability standpoint... it's basically an "and" of
    whether you're allowed by each PDB you're in, but I think this is a recipe
    for confusion, and race conditions debiting multiple PDBs for one eviction
    due to no transactions, etc.).


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p3LNNF3GUJm74vXSoCUtHoIdLcj2ks5q2d-ggaJpZM4Kdhb7
.

@smarterclayton
Copy link
Contributor

Yeah 3 addresses the large dense clusters case pretty well. I think it's a
good line between safety and forward progress, and will validate it with
some operators.

I think 3 is probably good enough I don't need the default, but defer to
others who may be more concerned.

On Mon, Oct 24, 2016 at 5:28 PM, Kenneth Owens [email protected]
wrote:

@mml https://github.com/mml What behavior would you expect on opt-out?
That eviction is equivalent to deletion with no PDB?
@smarterclayton https://github.com/smarterclayton If the default PDR is
not global, but applied to all Pods in the same RS or PS as in (3 proposed
by erictune), will that still be impossible for large dense clusters? Is
your concern here primarily that evictions makes progress? For RS=1 or
PS=1, using the default behavior described in (3), the single Pod will be
immediately evicted and rescheduled.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p1KnbrbCTAL2LWfd4uMxgQPu0WPgks5q3SMZgaJpZM4Kdhb7
.

@erictune
Copy link
Member

It doesn't look like we are going to be able to decide on a default behavior in time for 1.5, and talking about this more, I'm less sure that there is a single default that makes sense for everyone.

Still, I'd like it if we can leave open the possibility of a default behavior in the future.
The way to do this is to include a line like the following in the documentation for /eviction endpoint: "We may add a default behavior in the future. Clients should not be surprised if /eviction endpoint returns a 429 for a pod without an explicit PodDisruptionBudget in the future.
I mentioned this to @davidopp or @mwielgus already.

@smarterclayton
Copy link
Contributor

Sgtm

On Oct 26, 2016, at 7:11 PM, Eric Tune [email protected] wrote:

It doesn't look like we are going to be able to decide on a default
behavior in time for 1.5, and talking about this more, I'm less sure that
there is a single default that makes sense for everyone.

Still, I'd like it if we can leave open the possibility of a default
behavior in the future.
The way to do this is to include a line like the following in the
documentation for /eviction endpoint: "We may add a default behavior in the
future. Clients should not be surprised if /eviction endpoint returns a 429
for a pod without an explicit PodDisruptionBudget in the future.
I mentioned this to @davidopp https://github.com/davidopp or @mwielgus
https://github.com/mwielgus already.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pw6JlqiiXt8KrxDr66D9YuB2LDKVks5q394egaJpZM4Kdhb7
.

@erictune
Copy link
Member

We will recommend that users in production scenarios create their own PDB as part of the petset config bundle. We will recommend that template authors provide a PDB too. Assuming #34776 is implemented, this can be independent of the scale for scalable petsets.

@erictune erictune added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Nov 4, 2016
@davidopp
Copy link
Member

davidopp commented Nov 6, 2016

Presonally I don't think there is a reasonable default PDB behavior that applies across all controllers.

If PDBs are to be auto-generated, I think we need to figure out which controller auto-generates them. (I do like @erictune's idea to have controllers generate them better than my idea to have an admission controller do it.)

For StatefulSet it seems pretty straightforward. And the user could presumably provide a hint in the Spec if necessary.

For stateless, I'm not sure if the ReplicaSet, Deployment, or Service should be generating the PDB.

@smarterclayton
Copy link
Contributor

Do we have an easy create command / explanation for PDB yet? Some of this
can just be bulk creation of a PDB

kubectl get svc -o name | xargs -n 1 kubectl create pdb --max=1

Making that work well goes a long way, and it's trivial to create an
observer controller to set up a default pdb for all services.

On Nov 6, 2016, at 11:46 AM, David Oppenheimer [email protected]
wrote:

Presonally I don't think there is a reasonable default PDB behavior that
applies across all controllers.

If PDBs are to be auto-generated, I think we need to figure out which
controller auto-generates them. (I do like @erictune
https://github.com/erictune's idea to have controllers generate them
better than my idea to have an admission controller do it.)

For StatefulSet it seems pretty straightforward. And the user could
presumably provide a hint in the Spec if necessary.

For stateless, I'm not sure if the ReplicaSet, Deployment, or Service
should be generating the PDB.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p9Lq2AJgAROse4boUARXCkibAQxRks5q7i6qgaJpZM4Kdhb7
.

@enisoc
Copy link
Member

enisoc commented Jun 9, 2017

Removed from v1.7 since it doesn't seem to be imminent.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Labels Incomplete

@foxish @kow3ns @mwielgus

Action required: This issue requires label changes. If the required changes are not made within 3 days, the issue will be moved out of the v1.8 milestone.

kind: Must specify at most one of ['kind/bug', 'kind/feature', 'kind/cleanup'].
priority: Must specify at most one of ['priority/critical-urgent', 'priority/important-soon', 'priority/important-longterm'].

Additional instructions available here

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@kierenj
Copy link

kierenj commented Jun 20, 2019

I can see how specifying a PDB specifies behaviour, but looking through the docs (and this issue) I can't work out what the default behaviour is in absence of a PDB.

By default is there no limitation to disruption?

I spent some time looking for it, but hopefully I just missed it - where could I have found this info in the docs?

@mikkeloscar
Copy link
Contributor

There are no default PDBs so full disruption is allowed. We use this to provide sensible defaults for our users: https://github.com/mikkeloscar/pdb-controller

@andig
Copy link

andig commented Jun 20, 2019

There are no default PDBs so full disruption is allowed.

I'm not a k8es pro, but my feeling is that without PDB no disruption is allowed. I.e., I've noticed the autoscaler not scale down otherwise empty nodes when the PODs were missing PDBs. But then I might also be entirely wrong...

@mikkeloscar
Copy link
Contributor

The autoscaler might handle this in a special way by itself. For instance in the case where a pod doesn't have a "parent" like a replicaset or statefulset etc. then it will not terminate the pod, even though the eviction API would allow it. But this is special logic in the autoscaler.

@0xmichalis
Copy link
Contributor

0xmichalis commented Jun 20, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-lifecycle Issues or PRs related to Node lifecycle area/stateful-apps lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

No branches or pull requests