-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed etcd member failed to stop on stuck disk #14338
Comments
from reading the code, it seems etcd server was stuck in Lines 1020 to 1025 in 72d3e38
It was stuck there because v2 health check handler returned It looks like fsync was stuck in the middle. Looked into I expect despite of disk failures/latency, Much appreciate of any insights, thanks!! |
Okay. I get a repro working by injecting sleeping at raftAfterSave failpoint followed by member removal. The fifo scheduler stop is the culprit and getting stuck Lines 119 to 125 in 72d3e38
Will dig in a little deeper to understand why.. |
The raft loop was stuck in the middle and apply routine is waiting for it. Lines 1119 to 1122 in 72d3e38
I think the proper fix should be cancelling this apply successfully in the shutdown scenario even if disk write was stuck. Right now, the context is ignored. Lines 1059 to 1060 in 72d3e38
Any thoughts? @ahrtr @serathius @spzala |
Just a quick question before I have a deep dive, can you reproduce this on release-3.5 or main? |
I think I can but I haven't yet tried this. Will report in a few mins on release-3.5 or main with the reproduce script shared. Here is one for v3.4.18 reproduce.txt |
Here is one for v3.5.4 reproduce-3.5.txt Some side effect I observed. 2 leaders at a time but actually only one leader!
To elaborate more, the symptom we got is a stale watch connection was not cleaned up (it is supposed to be) with member removal. So client cache was always outdated...
|
It might not be safe to forcibly terminate the applying workflow. The most feasible solution for now is to print log repeatedly at server.go#L930 and server.go#L978 so as to provide more visibility on the issue. |
Please note that the |
Yeah, it only happens when disk IO is stuck in the middle. Usually it is caused by a data center outage. FYI, We are deploying a fix to the local monitoring agent to forcibly stop the server given it's already removed from the membership. However, it could have been done in etcd IMHO. |
Looks more like a feature request to force etcd to shutdown after being removed from cluster. My first thought is that this should be part of admin operation to kill etcd on disk failure. |
@chaochn47 Has this issue been resolved? I also encountered the problem of abnormal fluctuations in etcd, which is similar to your situation. |
What happened?
etcd failed to stop and stuck in stopping state after it was removed from membership. It went unresponsive for any requests sent to it.
What did you expect to happen?
I expect etcd can graceful terminate itself.
How can we reproduce it (as minimally and precisely as possible)?
It was observed in an availability zone outage. The reproduce can be like the following
Here is a similar reproduce #13527 but does not have member removal fault injection.
Anything else we need to know?
many more apply request took too long because "error":"context canceled" and continued for almost 2 hours.
rafthttp pipelines termination
...
etcd server stopped with exit code 0
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: