Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: don't race on upc/downc/switch endpoints in balancer #7842

Merged
merged 2 commits into from
May 3, 2017

Conversation

heyitsanthony
Copy link
Contributor

If the balancer update notification loop starts with a downed
connection and endpoints are switched while the old connection is up,
the balancer can potentially wait forever for an up connection without
refreshing the connections to reflect the current endpoints.

Instead, fetch upc/downc together, only caring about a single transition
either from down->up or up->down for each iteration

Simple way to reproduce failures: add time.Sleep(time.Second) to the
beginning of the update notification loop.

Found while investigating #7823, but not a fix.

@fanminshi
Copy link
Member

Simple way to reproduce failures: add time.Sleep(time.Second) to the beginning of the update notification loop.

I wonder how hard is to add a test to reproduce this failure.

@heyitsanthony
Copy link
Contributor Author

The easiest way to test the balancer tolerates delays on upc/downc would be a failpoint annotation to inject a sleep. I don't think it's worth going through all that just for this, though.

@fanminshi
Copy link
Member

@heyitsanthony fair enough. I would like to reproduce this issue myself. is adding sleep at beginning of the update notification loop. sufficient to trigger this? I think we also need to update the endpoints at a certain way along with the sleep to trigger this, right?

@heyitsanthony
Copy link
Contributor Author

@fanminshi add time.Sleep(time.Second) to the beginning of the for loop and try running TestDialSetEndpointsBeforeFail

@fanminshi
Copy link
Member

@heyitsanthony thanks.

@heyitsanthony heyitsanthony force-pushed the fix-switch-race branch 4 times, most recently from c78e1d1 to 998acd9 Compare May 2, 2017 23:42
If the balancer update notification loop starts with a downed
connection and endpoints are switched while the old connection is up,
the balancer can potentially wait forever for an up connection without
refreshing the connections to reflect the current endpoints.

Instead, fetch upc/downc together, only caring about a single transition
either from down->up or up->down for each iteration

Simple way to reproduce failures: add time.Sleep(time.Second) to the
beginning of the update notification loop.
Connection pausing added another exit condition in the listener
path, causing the bridge to leak connections instead of closing
them when signalled to close. Also adds some additional Close
paranoia.

Fixes etcd-io#7823
@heyitsanthony
Copy link
Contributor Author

#7823 now fixed by this

@gyuho
Copy link
Contributor

gyuho commented May 3, 2017

Is CI failure related to this change?

--- PASS: TestWatchCancelDisconnected (0.10s)
PASS
2017-05-02 20:18:04.284186 I | etcdserver/api/v3rpc: grpc: addrConn.transportMonitor exits due to: context canceled
Too many goroutines running after all test(s).
1 instances of:
google.golang.org/grpc.(*addrConn).resetTransport(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:832 +0x698
google.golang.org/grpc.(*addrConn).transportMonitor(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:912 +0x288
google.golang.org/grpc.(*ClientConn).resetAddrConn.func1(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:614 +0x1bc
created by google.golang.org/grpc.(*ClientConn).resetAddrConn
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:615 +0x328
1 instances of:
google.golang.org/grpc.(*ClientConn).lbWatcher(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:481 +0x70
created by google.golang.org/grpc.DialContext
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:424 +0x46c
exit status 1
FAIL	github.com/coreos/etcd/clientv3/integration	221.534s

https://jenkins-etcd-public.prod.coreos.systems/job/etcd-ci-ppc64/968/console

@heyitsanthony
Copy link
Contributor Author

@gyuho I can't reproduce this leak spinning on the test. I think it's grpc being bad about teardown-- there's no synchronization on the lbWatch goroutine.

@gyuho
Copy link
Contributor

gyuho commented May 3, 2017

lgtm. thanks!
/cc @xiang90 @fanminshi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants