-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struggle with etcd timeout, help #9159
Comments
How do you know? Is that HDD? |
from what i can see in the logging you provided, the leader is elected. the client requests timedout since the execution time is long for each request (took more than 10 seconds). probably you need to figure out why there are expensive requests sent to etcd. |
This is a config mistake, I changed to 8589934592. @hexfusion |
check out #9111 |
|
etcd serializes requests. you might have expensive request waiting in queue. |
What kind of requests? You mean the body is huge, or may be a transaction? @xiang90 |
I compiled the master branch, and ran a single cluster with the snapshot, no any write or read. And I found the update still timeout. There are lots of |
as the error suggests you try to revoked a not exist lease. i have no idea why your code tries to do that aggressively. i am not sure if this is the reason for timeout either. |
But there is no program to visit the singe etcd cluster which is created by the snapshot before. Or you mean the etcd snapshot will store the user requests before? This is a little unreasonable. So I guess it's the etcd itself that dose the revoking as the lease is out of date. But why the outdated lease is not found, this is still a question. And after the revoking requests failed, I think it should not be retried forever, or the queue would be fulfilled with the revoking requests and leads to timeout. |
you did not mention this before. i thought there were clients accessing the new etcd cluster. I am not sure why etcd can still process requests when there is no client at all. Can you confirm that is the case? we never heard about this before. if you believe this is the case, we would like to give it a look if you can share your snapshot file, and tell us how to reproduce the issue you hit. |
Yes, I am sure that there is no client. Can heavy lease expiring influence the client request? I'm sure there is no client request. The revoking requests are from etcd itself. |
it can. but we fixed it by jittering revoking time. there is also a bug report on k8s repo. what i do not understand is why etcd reports |
lease revoking without affecting client request is fixed in 3.2.2:https://github.com/coreos/etcd/blob/master/CHANGELOG.md#improved-2. however, i am not sure if you hit the same issue as i mentioned above around |
The snapshot is a little large, about 1.3G. Is there a good way to share it? @xiang90 |
@ximenzaoshi upload it somewhere secure, and send the url over email to me at [email protected]? thanks. |
@xiang90 |
i still do not understand this. also as i mentioned, k8s should not create these many leases. there is an issue in k8s repo. |
/cc @jpbetz |
Yes, the k8s shoud not create so many leases, we will fixed the problem later. The not found problem may be also caused by the huge leases. There can be two revoking requests on the same lease in queue. Maybe this is the reason? @xiang90 |
#9229 won't solve this problem. Related to #9360. |
Yes, you are right, lease Grant and Revoke will still contend on the same write lock. Splitting leases to different timespans may be helpful? And so much leases can not be a very common case. |
@ximenzaoshi I am planning to investigate further on this code path. Let's move this discussion to #9360. |
How did you rsolve this issue ? I am facing similar issue ? |
We use k8s in production, about 300 nodes. Serval days ago, the etcd cluster became abnormal, no leader can be elected and all the client requests timeout.
Here is the node config:
etcd version 3.2.14
Here is the master node endpoint status:
The master node log shows that the query time too long...
Here are the master node log and metrics files:
master-node.log
master-node-metrics.log
Disk, memory, network, all is OK. The problem was strange and made a disaster.
If you need more info, please tell me.
The text was updated successfully, but these errors were encountered: