Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struggle with etcd timeout, help #9159

Closed
ximenzaoshi opened this issue Jan 17, 2018 · 25 comments
Closed

Struggle with etcd timeout, help #9159

ximenzaoshi opened this issue Jan 17, 2018 · 25 comments

Comments

@ximenzaoshi
Copy link

ximenzaoshi commented Jan 17, 2018

We use k8s in production, about 300 nodes. Serval days ago, the etcd cluster became abnormal, no leader can be elected and all the client requests timeout.
Here is the node config:

/data/bin/etcd-3.2
    --name betcd05
    --initial-cluster betcd05=http://tzk05:3380,betcd03=http://tzk03:3380,betcd01=http://tzk01:3380,betcd04=http://tzk04:3380,betcd02=http://tzk02:3380
    --initial-advertise-peer-urls http://10.3.8.18:3380
    --data-dir /data/data/etcd-bay4
    --wal-dir /data/log/etcd-bay4
    --listen-peer-urls http://10.3.8.18:3380
    --listen-client-urls http://10.3.8.18:3379,http://127.0.0.1:3379
    --advertise-client-urls http://10.3.8.18:3379
    --initial-cluster-state existing
    --quota-backend-bytes=8589934592
    --heartbeat-interval '1000'
    --election-timeout '10000'

etcd version 3.2.14
Here is the master node endpoint status:

ETCDCTL_API=3 etcdctl --endpoints=tzk01:3379 endpoint status
tzk01:3379, 6c31ba6970ba5e26, 3.2.14, 3.4 GB, true, 346, 123773332

The master node log shows that the query time too long...

2018-01-17 18:51:46.388490 W | etcdserver: apply entries took too long [31.038059099s for 1 entries]
2018-01-17 18:51:46.388505 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:51:59.666084 W | etcdserver: apply entries took too long [12.772745279s for 1 entries]
2018-01-17 18:51:59.666101 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:52:13.236024 W | etcdserver: apply entries took too long [13.016438597s for 1 entries]
2018-01-17 18:52:13.236046 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:52:26.511690 W | etcdserver: apply entries took too long [12.756021442s for 1 entries]
2018-01-17 18:52:26.511711 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:52:39.861618 W | etcdserver: apply entries took too long [12.822592077s for 1 entries]
2018-01-17 18:52:39.861636 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:52:54.475787 W | etcdserver: apply entries took too long [14.083658231s for 1 entries]
2018-01-17 18:52:54.475806 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:07.687890 W | etcdserver: apply entries took too long [12.71080088s for 1 entries]
2018-01-17 18:53:07.687908 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:20.919569 W | etcdserver: apply entries took too long [12.708351937s for 1 entries]
2018-01-17 18:53:20.919585 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:21.169769 W | etcdserver: apply entries took too long [103.492288ms for 1 entries]
2018-01-17 18:53:21.169795 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:40.588217 W | etcdserver: apply entries took too long [19.100170034s for 1 entries]
2018-01-17 18:53:40.588232 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:55.389231 W | etcdserver: apply entries took too long [14.270018598s for 1 entries]
2018-01-17 18:53:55.389248 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:53:55.408950 I | mvcc: store.index: compact 2476681628
2018-01-17 18:53:55.836416 W | etcdserver: apply entries took too long [427.854965ms for 1 entries]
2018-01-17 18:53:55.836435 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:54:02.478435 I | mvcc: finished scheduled compaction at 2476681628 (took 6.641971303s)
2018-01-17 18:54:09.624533 W | etcdserver: apply entries took too long [13.735092749s for 1 entries]
2018-01-17 18:54:09.624553 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:54:24.352842 W | etcdserver: apply entries took too long [14.197159572s for 1 entries]
2018-01-17 18:54:24.352859 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:54:38.258577 W | etcdserver: apply entries took too long [13.372680952s for 1 entries]
2018-01-17 18:54:38.258593 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:54:52.069577 W | etcdserver: apply entries took too long [13.249941442s for 1 entries]
2018-01-17 18:54:52.069595 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:55:06.132372 W | etcdserver: apply entries took too long [13.491814966s for 1 entries]
2018-01-17 18:55:06.132391 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:55:21.022463 I | wal: segmented wal file /data/log/etcd-bay4/0000000000000405-000000000760e9ba.wal is created
2018-01-17 18:55:25.017074 W | etcdserver: apply entries took too long [18.367109561s for 1 entries]
2018-01-17 18:55:25.017427 W | etcdserver: avoid queries with large range/delete range!
2018-01-17 18:55:38.776670 W | etcdserver: apply entries took too long [13.190163366s for 1 entries]
2018-01-17 18:55:38.776686 W | etcdserver: avoid queries with large range/delete range!

Here are the master node log and metrics files:
master-node.log
master-node-metrics.log

Disk, memory, network, all is OK. The problem was strange and made a disaster.
If you need more info, please tell me.

@hexfusion
Copy link
Contributor

One thing I notice in your master-node logs is this.

2018-01-16 00:21:31.092323 W | etcdserver: backend quota 167772160000 exceeds maximum recommended quota 8589934592

/cc @gyuho @xiang90

@gyuho
Copy link
Contributor

gyuho commented Jan 17, 2018

Disk, memory, network, all is OK.

How do you know? Is that HDD?

@xiang90
Copy link
Contributor

xiang90 commented Jan 17, 2018

no leader can be elected and all the client requests timeout.

from what i can see in the logging you provided, the leader is elected. the client requests timedout since the execution time is long for each request (took more than 10 seconds).

probably you need to figure out why there are expensive requests sent to etcd.

@ximenzaoshi
Copy link
Author

This is a config mistake, I changed to 8589934592. @hexfusion
We have host monitor. @gyuho
After the electing config changed, as you can see, 1000ms and 10000ms, the leader can be elected, it can not be elected before changing the config. But the most import thing now is the timeout requests. Thank you for your advice, I tried but no result. Is there any tools for this, or etcd can tell which key took so much time? @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jan 17, 2018

Is there any tools for this, or etcd can tell which key took so much time?

check out #9111

@ximenzaoshi
Copy link
Author

ETCDCTL_API=3 etcdctl --endpoints=localhost:3379 put foo bar --command-timeout=100s
This update tool about 12s. As I can see, any update operation takes the same time now...
@xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jan 17, 2018

@ximenzaoshi

etcd serializes requests. you might have expensive request waiting in queue.

@ximenzaoshi
Copy link
Author

What kind of requests? You mean the body is huge, or may be a transaction? @xiang90

@ximenzaoshi
Copy link
Author

I compiled the master branch, and ran a single cluster with the snapshot, no any write or read. And I found the update still timeout. There are lots of 2018-01-19 10:07:38.538741 W | etcdserver: failed to revoke 754f60fad2005caf ("lease not found") logs like this. All the pending proposals are related to the lease revoking.
So why the leases to revoke can be not found, is this the reason of timeout?
@xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2018

@ximenzaoshi

as the error suggests you try to revoked a not exist lease. i have no idea why your code tries to do that aggressively. i am not sure if this is the reason for timeout either.

@ximenzaoshi
Copy link
Author

ximenzaoshi commented Jan 19, 2018

But there is no program to visit the singe etcd cluster which is created by the snapshot before. Or you mean the etcd snapshot will store the user requests before? This is a little unreasonable. So I guess it's the etcd itself that dose the revoking as the lease is out of date. But why the outdated lease is not found, this is still a question. And after the revoking requests failed, I think it should not be retried forever, or the queue would be fulfilled with the revoking requests and leads to timeout.
Anyway, why the etcd would revoke a lease not found is the beginning question. @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2018

But there is no program to visit the singe etcd cluster which is created by the snapshot before.

you did not mention this before. i thought there were clients accessing the new etcd cluster.

I am not sure why etcd can still process requests when there is no client at all. Can you confirm that is the case? we never heard about this before.

if you believe this is the case, we would like to give it a look if you can share your snapshot file, and tell us how to reproduce the issue you hit.

@ximenzaoshi
Copy link
Author

Yes, I am sure that there is no client. Can heavy lease expiring influence the client request? I'm sure there is no client request. The revoking requests are from etcd itself.
Maybe there is a version problem? The origin snapshot is created by version 3.1.9. I used the latest master branch to test. Are there any important changes for recent version about lease?
The snapshot would be provided later.

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2018

Can heavy lease expiring influence the client request?

it can. but we fixed it by jittering revoking time. there is also a bug report on k8s repo.

what i do not understand is why etcd reports lease not found if the revoking is from etcd itself.

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2018

lease revoking without affecting client request is fixed in 3.2.2:https://github.com/coreos/etcd/blob/master/CHANGELOG.md#improved-2.

however, i am not sure if you hit the same issue as i mentioned above around lease not found

@ximenzaoshi
Copy link
Author

The snapshot is a little large, about 1.3G. Is there a good way to share it? @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2018

@ximenzaoshi upload it somewhere secure, and send the url over email to me at [email protected]? thanks.

@ximenzaoshi
Copy link
Author

@xiang90
Problem solved. Thanks for your help.
Kubernetes created too much leases in etcd, about 20000K. Creating is faster than revoking, and after a long time, the huge leases become a big problem. It costs etcd too much time on ranging the lease map which holds a lock that has a big influence on updating.
Creating huge leases to etcd can reproduce our problem.
I made a PR(#9226) on this problem.

@xiang90
Copy link
Contributor

xiang90 commented Jan 25, 2018

what i do not understand is why etcd reports lease not found if the revoking is from etcd itself.

i still do not understand this. also as i mentioned, k8s should not create these many leases. there is an issue in k8s repo.

@xiang90
Copy link
Contributor

xiang90 commented Jan 25, 2018

/cc @jpbetz

@ximenzaoshi
Copy link
Author

Yes, the k8s shoud not create so many leases, we will fixed the problem later. The not found problem may be also caused by the huge leases. There can be two revoking requests on the same lease in queue. Maybe this is the reason? @xiang90

@gyuho
Copy link
Contributor

gyuho commented Feb 27, 2018

Kubernetes created too much leases in etcd, about 20000K. Creating is faster than revoking, and after a long time, the huge leases become a big problem. It costs etcd too much time on ranging the lease map which holds a lock that has a big influence on updating. Creating huge leases to etcd can reproduce our problem.

#9229 won't solve this problem.
It only makes lease lookup calls non-blocking when there's no grant, revoke routine.
Lease Grant and Revoke will still contend on the same write lock.

Related to #9360.

@ximenzaoshi
Copy link
Author

Yes, you are right, lease Grant and Revoke will still contend on the same write lock. Splitting leases to different timespans may be helpful? And so much leases can not be a very common case.

@gyuho
Copy link
Contributor

gyuho commented Mar 1, 2018

@ximenzaoshi I am planning to investigate further on this code path. Let's move this discussion to #9360.

@kannanvr
Copy link

How did you rsolve this issue ? I am facing similar issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants