-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: Lessor can abort the main loop and keep failing #7880
Comments
I read our doc on Lease API, we actually say nothing about the client: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L91-L111. We also say nothing about the client on the haltErr: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L76-L81. We probably can change this behavior without breaking our "contract". |
ok I can change the interface for that other one |
Update) Watch might have the same issue as well. https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L484-L487 |
Created a PR anyway. I can send another PR to backport the same fix to the release-3.1 branch if needed (yudai@8a4bb15) |
Fixed by #7890 |
@xiang90 Just a question. Is there any page available online that explains the maintenance cycle of releases? I'm wondering if this fix will be backported to 3.1.8 (or next release will be 3.2 and no 3.1.8?). Because it seems 3.2 will have some non-backward-compatible changes, so we are now thinking that we might need to maintain 3.1 release ourselves for a while, if 3.1.7 is the final version of 3.1.x release. |
@yudai We have release policy here https://github.com/coreos/etcd/blob/master/Documentation/dev-internal/release.md#patch-version-release and follow http://semver.org as much as we can. Your patch will be in our next 3.2.0-rc.1 + (planned for this Friday). /cc @heyitsanthony |
@gyuho Thanks for the reference. |
Issue Observed
lessor.KeepAlive()
keeps raising errors after a server side trouble.I my case, after some trouble, it kept raising the error
leases keep alive halted: rpc error: code = 11 desc = EOF"
for everyKeepAlive()
call.Issue Detail
I have a private grpc proxy, which executes some authentication on every request, between clients and the etcd server.
After something (still investigating) happened, clients started getting the error from
lessor.KeepAlive()
and never recovered.lessor.recvKeepAliveLoop()
, which is the main loop oflessor
, returns when it gets an error which is notHaltErr
. Once it's returned, there is no main loop running (andlessor.donec
is closed in the deferred func), soKeepAlive()
returns with the error.IsHaltErr()
returnsfalse
for context cancelation or timeout, which makes sense. However, as for grpc errors, onlycodes.Unavailable
andcodes.Internal
are considered retryable errors.. So other grpc errors retuned by servers have the function returntrue
, even when it's actually retryable.In my case, the grpc error with code 11 were considered critical to halt and the main loop returned, then even after everything recovered, the client kept failing on
KeepAlive()
.You can reproduce the same issue by using the grpcproxy. It returns sometimes
context.Canceled
, which will be the code1
error by grpc-go.The main server code in
v3api
might return to-Halt errors as well, but I'm not sure about this.Suggested Resolution
It's probably hard to list up all recoverable errors in
IsHaltErr()
, and it makes hard for third parties to implement proxies for etcd (even the official grpcproxy has the issue).I therefore think the main loop should not return by any server side error. The current client library is supposed to be server-side error proof (automatically retry) , so it's should be return only by users' context cancel (or timeout).
I'm not sure if it's better to simply remove the conditions for grpc errors in
IsHaltErr()
, which are used by some other methods. Or we can add a new function likeIsUserCancel(err)
.I'll crete a PR if this should be resolved in the client side.
Otherwise, I think the grpcproxy should be fixed at least not to return context errors.
Appendix: Workaroud
The text was updated successfully, but these errors were encountered: