-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: don't race on upc/downc/switch endpoints in balancer #7842
Conversation
I wonder how hard is to add a test to reproduce this failure. |
78fba8b
to
343d13b
Compare
The easiest way to test the balancer tolerates delays on upc/downc would be a failpoint annotation to inject a sleep. I don't think it's worth going through all that just for this, though. |
@heyitsanthony fair enough. I would like to reproduce this issue myself. is adding sleep at |
@fanminshi add time.Sleep(time.Second) to the beginning of the for loop and try running TestDialSetEndpointsBeforeFail |
@heyitsanthony thanks. |
c78e1d1
to
998acd9
Compare
If the balancer update notification loop starts with a downed connection and endpoints are switched while the old connection is up, the balancer can potentially wait forever for an up connection without refreshing the connections to reflect the current endpoints. Instead, fetch upc/downc together, only caring about a single transition either from down->up or up->down for each iteration Simple way to reproduce failures: add time.Sleep(time.Second) to the beginning of the update notification loop.
998acd9
to
61abf25
Compare
Connection pausing added another exit condition in the listener path, causing the bridge to leak connections instead of closing them when signalled to close. Also adds some additional Close paranoia. Fixes etcd-io#7823
#7823 now fixed by this |
Is CI failure related to this change?
https://jenkins-etcd-public.prod.coreos.systems/job/etcd-ci-ppc64/968/console |
@gyuho I can't reproduce this leak spinning on the test. I think it's grpc being bad about teardown-- there's no synchronization on the lbWatch goroutine. |
lgtm. thanks! |
If the balancer update notification loop starts with a downed
connection and endpoints are switched while the old connection is up,
the balancer can potentially wait forever for an up connection without
refreshing the connections to reflect the current endpoints.
Instead, fetch upc/downc together, only caring about a single transition
either from down->up or up->down for each iteration
Simple way to reproduce failures: add time.Sleep(time.Second) to the
beginning of the update notification loop.
Found while investigating #7823, but not a fix.