Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd compaction stopped. #5936

Closed
zapman449 opened this issue Oct 13, 2018 · 9 comments
Closed

etcd compaction stopped. #5936

zapman449 opened this issue Oct 13, 2018 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@zapman449
Copy link

1. What kops version are you running? The command kops version, will display
this information.

Version 1.10.0 (git-8b52ea6d1)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T09:56:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Left a functioning cluster alone for a few days. Came back to errors running kubectl apply -f foo.yaml:

Error from server: error when creating "kube_cluster_metrics_rbac.yaml": etcdserver: mvcc: database space exceeded

5. What happened after the commands executed?
apply failed

6. What did you expect to happen?
apply to succeed

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

ETCD relevant bits:

  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: -2a
    - instanceGroup: master-us-west-2c
      name: -2c
    - instanceGroup: master-us-west-2b
      name: -2b
    manager:
      image: kopeio/etcd-manager:latest
    name: main
    version: 3.2.18
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: -2a
    - instanceGroup: master-us-west-2c
      name: -2c
    - instanceGroup: master-us-west-2b
      name: -2b
    manager:
      image: kopeio/etcd-manager:latest
    name: events
    version: 3.2.18

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Highly relevant ticket #4005
This might be a duplicate.

Logs show 5 minutely compactions, until suddenly stopping 7 days ago:

$ grep compact /var/log/etcd.log | tail -30
2018-10-05 13:38:41.404362 I | mvcc: store.index: compact 153466
2018-10-05 13:38:41.405919 I | mvcc: finished scheduled compaction at 153466 (took 1.127869ms)
2018-10-05 13:43:41.410039 I | mvcc: store.index: compact 153984
2018-10-05 13:43:41.411714 I | mvcc: finished scheduled compaction at 153984 (took 1.234292ms)
2018-10-05 13:48:41.415629 I | mvcc: store.index: compact 154503
2018-10-05 13:48:41.417139 I | mvcc: finished scheduled compaction at 154503 (took 1.141945ms)
2018-10-05 13:53:41.422599 I | mvcc: store.index: compact 155024
2018-10-05 13:53:41.424172 I | mvcc: finished scheduled compaction at 155024 (took 1.195964ms)
2018-10-05 13:58:41.429730 I | mvcc: store.index: compact 155542
2018-10-05 13:58:41.431275 I | mvcc: finished scheduled compaction at 155542 (took 1.143702ms)
2018-10-05 14:03:41.436501 I | mvcc: store.index: compact 156062
2018-10-05 14:03:41.438193 I | mvcc: finished scheduled compaction at 156062 (took 1.208913ms)
2018-10-05 14:08:41.442659 I | mvcc: store.index: compact 156580
2018-10-05 14:08:41.444193 I | mvcc: finished scheduled compaction at 156580 (took 1.13716ms)
2018-10-05 20:30:31.268294 I | etcdserver: compacted raft log at 195002
2018-10-06 13:38:10.697073 I | etcdserver: compacted raft log at 295003
2018-10-07 07:13:22.475704 I | etcdserver: compacted raft log at 395004
2018-10-08 01:45:50.962248 I | etcdserver: compacted raft log at 495005
2018-10-08 20:21:48.255609 I | etcdserver: compacted raft log at 595006
2018-10-09 14:56:39.884253 I | etcdserver: compacted raft log at 695007
2018-10-10 09:32:16.886954 I | etcdserver: compacted raft log at 795008
2018-10-10 21:52:54.004659 I | etcdserver: compacted raft log at 895009
2018-10-11 05:58:08.475681 I | etcdserver: compacted raft log at 995010
2018-10-11 14:03:12.917882 I | etcdserver: compacted raft log at 1095011
2018-10-11 22:06:49.750069 I | etcdserver: compacted raft log at 1195012
2018-10-12 06:12:09.806502 I | etcdserver: compacted raft log at 1295013
2018-10-12 14:17:41.242084 I | etcdserver: compacted raft log at 1395014
2018-10-12 22:21:29.007856 I | etcdserver: compacted raft log at 1495015
2018-10-12 23:43:09.039160 I | mvcc: store.index: compact 812548
2018-10-12 23:43:31.607978 I | mvcc: finished scheduled compaction at 812548 (took 22.568123354s)

Following a modified version of the steps in #4005 I was able to get a compaction going (that last log entry), but after that point, I'm not seeing more scheduled, and I'm not seeing the compactions from the kube-api I'm expecting.

@zapman449
Copy link
Author

so, I've hit this again, this time with a cluster less than 24 hours old. The fix from #4005 did not work this time, it turns out the DNS entries in route53 were pointing to the placeholder IP of 203.0.113.123 .

I manually updated the 6 dns entries in route53, and the etcd cluster (and kube cluster) resumed functioning

@zapman449
Copy link
Author

zapman449 commented Oct 23, 2018

Extra wrinkle: kube-apiserver is restarting every minute presumably due to this etcd issue:

the issue about kube-apiserver restarting minutely was self-inflicted, and not related to this issue.

@zapman449
Copy link
Author

and this certainly isn't helping:

$ sudo grep -v " got ping " /var/log/etcd.log | grep -v " AWS API Request:" | grep -v "updating hosts: map" | grep -v "we are not leader" | grep -v "starting controller iteration" | grep -v -- "peers.*peer.*etcd--2a" | grep -- "10-23 14:"
2018-10-23 14:03:46.757352 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.786295ms)
2018-10-23 14:03:46.757559 W | etcdserver: server is likely overloaded
2018-10-23 14:03:46.757593 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.037746ms)
2018-10-23 14:03:46.757616 W | etcdserver: server is likely overloaded
2018-10-23 14:12:47.670713 W | etcdserver: apply entries took too long [119.490314ms for 1 entries]
2018-10-23 14:12:47.670762 W | etcdserver: avoid queries with large range/delete range!
2018-10-23 14:12:54.029650 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.582398ms)
2018-10-23 14:12:54.030660 W | etcdserver: server is likely overloaded
2018-10-23 14:12:54.030747 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.687879ms)
2018-10-23 14:12:54.030904 W | etcdserver: server is likely overloaded

@zapman449
Copy link
Author

Provisional finding: kube 1.10 / kops 1.10 does not play nice with etcd 3.2.18 (or etcd 3.2.24). Need to use etcd 3.1.12 . This factoid was derived from https://github.com/kopeio/etcd-manager/blob/f4db9a739e90833cb8f7152f58535e30c84d33a0/images/BUILD#L13-L18 .

@zapman449
Copy link
Author

and... I'm in a compaction error again.

@zapman449 zapman449 reopened this Oct 24, 2018
@zapman449 zapman449 reopened this Oct 24, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants