Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete pods API call latencies shot up on large cluster tests #51899

Closed
shyamjvs opened this issue Sep 4, 2017 · 142 comments
Closed

delete pods API call latencies shot up on large cluster tests #51899

shyamjvs opened this issue Sep 4, 2017 · 142 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Sep 4, 2017

Updated issue description with latest findings:

#50350 introduced a change to kubelet pod deletion that results in delete pod API calls from kubelets being concentrated immediately after container garbage collection.

When performing a deletion of large numbers (thousands) of pods across large numbers (hundreds) of nodes, the resulting concentrated delete calls from the kubelets cause increased latency of delete pods API calls above the target threshold:

Sep 27 14:34:14.484: INFO: WARNING Top latency metric: {Resource:pods Subresource: Verb:DELETE Latency:{Perc50:4.551ms Perc90:10.216ms Perc99:1.403002s Perc100:0s} Count:122966}

Seen on https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/

Not seen on https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/

hoisted details from #51899 (comment) into description:
Graphs over a three minute window (both latency in ms and distribution of delete calls per second):

the delete calls from the gc controller merely set deletionTimestamp on the pod (which the kubelets observe and start shutting down the pods). the delete calls from the kubelet actually delete from etcd.

screen shot 2017-09-28 at 11 47 18 am

screen shot 2017-09-28 at 12 05 18 pm

there were multiple spikes throughout the run (corresponding to deletions of 3000-pod replication controllers). here's a graph of delete calls that took more than one second throughout the whole run:
screen shot 2017-09-28 at 11 24 22 am

Seems like a kubelet-induced thundering herd, when deleting massive numbers of pods across massive numbers of nodes. kubemark's gc is stubbed since there are no real containers, so this isn't observed there

Original issue description follows:

From 5k-node density test run (no. 26) on last friday:

I0901 10:11:29.395] Sep  1 10:11:29.395: INFO: WARNING Top latency metric: {Resource:pods Subresource: Verb:LIST Latency:{Perc50:66.805ms Perc90:375.804ms Perc99:8.111217s Perc100:0s} Count:11911}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:nodes Subresource: Verb:LIST Latency:{Perc50:417.103ms Perc90:1.090847s Perc99:2.336106s Perc100:0s} Count:1261}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: WARNING Top latency metric: {Resource:nodes Subresource:status Verb:PATCH Latency:{Perc50:5.633ms Perc90:279.15ms Perc99:1.04109s Perc100:0s} Count:5543730}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:pods Subresource:binding Verb:POST Latency:{Perc50:1.442ms Perc90:15.171ms Perc99:921.523ms Perc100:0s} Count:155096}
I0901 10:11:29.396] Sep  1 10:11:29.395: INFO: Top latency metric: {Resource:replicationcontrollers Subresource: Verb:POST Latency:{Perc50:3.13ms Perc90:218.472ms Perc99:886.065ms Perc100:0s} Count:5049}

And from the last healthy run (no. 23) of the test:

I0826 03:49:44.301] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:pods Subresource: Verb:LIST Latency:{Perc50:830.411ms Perc90:1.468316s Perc99:2.627309s Perc100:0s} Count:10615}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:nodes Subresource: Verb:LIST Latency:{Perc50:337.423ms Perc90:427.443ms Perc99:998.426ms Perc100:0s} Count:1800}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:apiservices Subresource: Verb:PUT Latency:{Perc50:954µs Perc90:5.726ms Perc99:138.528ms Perc100:0s} Count:840}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:namespaces Subresource: Verb:GET Latency:{Perc50:445µs Perc90:2.641ms Perc99:83.57ms Perc100:0s} Count:1608}
I0826 03:49:44.302] Aug 26 03:49:44.301: INFO: Top latency metric: {Resource:nodes Subresource:status Verb:PATCH Latency:{Perc50:3.22ms Perc90:9.449ms Perc99:55.588ms Perc100:0s} Count:7033941}

This is a huge increase we're seeing:
LIST pods : 2.6s -> 8.1s
LIST nodes: 1s -> 2.3s
PATCH node-status: 56ms -> 1s
...

cc @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-misc @smarterclayton @wojtek-t @gmarek

@shyamjvs shyamjvs added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Sep 4, 2017
@shyamjvs shyamjvs added this to the v1.8 milestone Sep 4, 2017
@shyamjvs shyamjvs self-assigned this Sep 4, 2017
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Sep 4, 2017
@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

I'm trying to find the offending PR(s). The diff is too huge across those runs.
Luckily this is even seen on relatively smaller clusters for e.g kubemark-500 - https://k8s-testgrid.appspot.com/sig-scalability#kubemark-500

@wojtek-t
Copy link
Member

wojtek-t commented Sep 4, 2017

My strong feeling is that it may be related to pagination.

If I'm right, #51876 should hopefully fix the problem.

@smarterclayton

@wojtek-t
Copy link
Member

wojtek-t commented Sep 4, 2017

But this is mostly a guess and it may also be something different.

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

@kubernetes/test-infra-maintainers We don't have any logs (except the build-log from jenkins) for run-26 of ci-kubernetes-e2e-gce-scale-performance because jenkins crashed midway in the test:

ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2820)

Such failures (that too right at the end of the run) can hit us really hard wrt debugging.

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

@wojtek-t We're also seeing increase in the patch latencies. So there probably is another regression too?

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

Till run 8132, kubemark-500 seemed fine:

    {
      "data": {
        "Perc50": 31.303,
        "Perc90": 39.326,
        "Perc99": 59.415
      },
      "unit": "ms",
      "labels": {
        "Count": "43",
        "Resource": "nodes",
        "Subresource": "",
        "Verb": "LIST"
      }
    },

We had some startup failures in b/w and it seems to shot up starting from run 8141:

    {
      "data": {
        "Perc50": 213.34,
        "Perc90": 608.223,
        "Perc99": 1014.937
      },
      "unit": "ms",
      "labels": {
        "Count": "53",
        "Resource": "nodes",
        "Subresource": "",
        "Verb": "LIST"
      }
    },

Looking at the diff right now.

@wojtek-t
Copy link
Member

wojtek-t commented Sep 4, 2017

Another possiblity (that you mentioned to me offline) is:
#48287

I initially doubted about it, but now when I think about it, maybe it's about memory allocations (I didn't look carefully, but maybe it's much higher with json-iter)?

@thockin - FYI (still just my guess)

@wojtek-t
Copy link
Member

wojtek-t commented Sep 4, 2017

Anyway - @shyamjvs we need more data about what other metrics has changed (cpu usage, allocations, number of API calls, ...)

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

After eliminating trivial and unrelated PRs, the following are left:

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

Anyway - @shyamjvs we need more data about what other metrics has changed (cpu usage, allocations, number of API calls, ...)

Yup.. I'm doing some local experiments to verify this. Also merged #51892 to start getting kubemark master logs.

@gmarek
Copy link
Contributor

gmarek commented Sep 4, 2017

Can you grab CPU profile from the master when running test? It should help you with figuring out the cause.

@gmarek
Copy link
Contributor

gmarek commented Sep 4, 2017

Adding release team: @jdumars @calebamiles @spiffxp

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

So I fetched the CPU & memory usage for the load test from the last > 200 runs of kubemark-500, and I don't see any noticeable increase. Actually it appears like it improved:

cpu_usage

mem_usage

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

There are some consecutive failed runs (due to testinfra bugs / regressions) leading to missing points or temporary spikes... Neglect the noise

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 4, 2017

I ran the tests locally on kubemark-500 against head (commit ffed1d3) and it went through fine with low latency values like before. Re-running again to see if it's flaky. I'll continue digging into it tomorrow.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 5, 2017 via email

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 5, 2017

Ok... So the reason for kubemark failing is different. As @wojtek-t pointed out, the kubemark master is not rightly sized.
I recently made some changes in test-infra around auto-calculation of kubemark master-size and seems like it didn't take effect.

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 5, 2017

Btw - that also explains why the failure wasn't reproducing locally for me. Because I bypassed all the testinfra logic by manually starting the clusters, it got rightly calculated.

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 5, 2017

@smarterclayton Thanks for the explanation. I checked that neither of kubemark nor our real cluster tests are using pagination (as ?limit=1 is still returning all the results).
So the real 5k-node test failures seems unrelated to it.

@liggitt
Copy link
Member

liggitt commented Sep 28, 2017

cc @dashpole @vishh @dchen1107
from introducing PR (#50350)

@dashpole
Copy link
Contributor

I can post a PR to revert the key portion of #50350.

@jdumars
Copy link
Member

jdumars commented Sep 28, 2017

@dashpole thank you!

@shyamjvs
Copy link
Member Author

I guess we should also wait until the fix is tested too (add ~2 more hours for it).

@dashpole
Copy link
Contributor

PR posted: #53210

@dchen1107
Copy link
Member

@dashpole and I had an offline discussion, and plan to revert portion of #50350. The original pr was introduced because without that change, Kubelet disk eviction manager might delete more containers unnecessarily.

We should improve disk management with more intelligence to take proactive action, instead of relying on the periodical gc.

@jdumars
Copy link
Member

jdumars commented Sep 28, 2017

While this is in the 1.9 milestone, there is every expectation that this will be resolved at the earliest responsible moment in a 1.8 patch release. The full decision-making process around this can be viewed at https://youtu.be/r6D5DNel2l8

@jpbetz
Copy link
Contributor

jpbetz commented Oct 2, 2017

Any updates on this?

@wojtek-t
Copy link
Member

wojtek-t commented Oct 2, 2017

We already have a victim PR - the PR that is fixing that is #53233

k8s-github-robot pushed a commit that referenced this issue Oct 3, 2017
Automatic merge from submit-queue (batch tested with PRs 51765, 53053, 52771, 52860, 53284). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add audit-logging, feature-gates & few admission plugins to kubemark

To make kubemark match real cluster settings. Also includes a few other settings like request-timeout, etcd-quorum, etc.

Fixes #53021
Related #51899 #44701

cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @smarterclayton
k8s-github-robot pushed a commit that referenced this issue Oct 4, 2017
Automatic merge from submit-queue (batch tested with PRs 53403, 53233). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove containers from deleted pods once containers have exited

Issue #51899 
Since container deletion is currently done through periodic garbage collection every 30 seconds, it takes a long time for pods to be deleted, and causes the kubelet to send all delete pod requests at the same time, which has performance issues.  This PR makes the kubelet actively remove containers of deleted pods rather than wait for them to be removed in periodic garbage collection.

/release-note-none
@liggitt
Copy link
Member

liggitt commented Oct 4, 2017

#53233 is merged to master. pick for 1.8.1 open at #53422

k8s-github-robot pushed a commit that referenced this issue Oct 4, 2017
…3-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #53233

Fixes #51899

Cherry pick of #53233 on release-1.8.

#53233: remove containers of deleted pods once all containers have

```release-note
Fixes a performance issue (#51899) identified in large-scale clusters when deleting thousands of pods simultaneously across hundreds of nodes, by actively removing containers of deleted pods, rather than waiting for periodic garbage collection and batching resulting pod API deletion requests.
```
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@dashpole @shyamjvs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-node-bugs @kubernetes/sig-scalability-bugs

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer.

Issue Labels
  • sig/api-machinery sig/node sig/scalability: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@dashpole
Copy link
Contributor

dashpole commented Oct 5, 2017

/close
Closed via #53422. The fix be in 1.8.1.

sttts pushed a commit to sttts/apiserver that referenced this issue Oct 13, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Modify traces in deletion handler

Ref kubernetes/kubernetes#51899 (comment)

cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8?

Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d
sttts pushed a commit to sttts/apiserver that referenced this issue Oct 14, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Modify traces in deletion handler

Ref kubernetes/kubernetes#51899 (comment)

cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8?

Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d
sttts pushed a commit to sttts/apiserver that referenced this issue Oct 16, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Modify traces in deletion handler

Ref kubernetes/kubernetes#51899 (comment)

cc @kubernetes/sig-release-members @jdumars @dims Can we get this into 1.8?

Kubernetes-commit: 5952e932e9a1b593876c1504df5d1cb3fd72299d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests