clientv3: don't race on upc/downc/switch endpoints in balancer #7842

heyitsanthony · 2017-05-01T16:49:58Z

If the balancer update notification loop starts with a downed
connection and endpoints are switched while the old connection is up,
the balancer can potentially wait forever for an up connection without
refreshing the connections to reflect the current endpoints.

Instead, fetch upc/downc together, only caring about a single transition
either from down->up or up->down for each iteration

Simple way to reproduce failures: add time.Sleep(time.Second) to the
beginning of the update notification loop.

Found while investigating #7823, but not a fix.

fanminshi · 2017-05-01T18:57:37Z

Simple way to reproduce failures: add time.Sleep(time.Second) to the beginning of the update notification loop.

I wonder how hard is to add a test to reproduce this failure.

heyitsanthony · 2017-05-01T19:52:55Z

The easiest way to test the balancer tolerates delays on upc/downc would be a failpoint annotation to inject a sleep. I don't think it's worth going through all that just for this, though.

fanminshi · 2017-05-01T20:14:46Z

@heyitsanthony fair enough. I would like to reproduce this issue myself. is adding sleep at beginning of the update notification loop. sufficient to trigger this? I think we also need to update the endpoints at a certain way along with the sleep to trigger this, right?

heyitsanthony · 2017-05-01T20:15:38Z

@fanminshi add time.Sleep(time.Second) to the beginning of the for loop and try running TestDialSetEndpointsBeforeFail

fanminshi · 2017-05-01T20:17:39Z

@heyitsanthony thanks.

If the balancer update notification loop starts with a downed connection and endpoints are switched while the old connection is up, the balancer can potentially wait forever for an up connection without refreshing the connections to reflect the current endpoints. Instead, fetch upc/downc together, only caring about a single transition either from down->up or up->down for each iteration Simple way to reproduce failures: add time.Sleep(time.Second) to the beginning of the update notification loop.

Connection pausing added another exit condition in the listener path, causing the bridge to leak connections instead of closing them when signalled to close. Also adds some additional Close paranoia. Fixes etcd-io#7823

heyitsanthony · 2017-05-03T00:36:37Z

#7823 now fixed by this

gyuho · 2017-05-03T17:35:26Z

Is CI failure related to this change?

--- PASS: TestWatchCancelDisconnected (0.10s)
PASS
2017-05-02 20:18:04.284186 I | etcdserver/api/v3rpc: grpc: addrConn.transportMonitor exits due to: context canceled
Too many goroutines running after all test(s).
1 instances of:
google.golang.org/grpc.(*addrConn).resetTransport(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:832 +0x698
google.golang.org/grpc.(*addrConn).transportMonitor(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:912 +0x288
google.golang.org/grpc.(*ClientConn).resetAddrConn.func1(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:614 +0x1bc
created by google.golang.org/grpc.(*ClientConn).resetAddrConn
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:615 +0x328
1 instances of:
google.golang.org/grpc.(*ClientConn).lbWatcher(...)
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:481 +0x70
created by google.golang.org/grpc.DialContext
	/var/jenkins_home/workspace/etcd-ci-ppc64/gopath/src/google.golang.org/grpc/clientconn.go:424 +0x46c
exit status 1
FAIL	github.com/coreos/etcd/clientv3/integration	221.534s

https://jenkins-etcd-public.prod.coreos.systems/job/etcd-ci-ppc64/968/console

heyitsanthony · 2017-05-03T17:59:08Z

@gyuho I can't reproduce this leak spinning on the test. I think it's grpc being bad about teardown-- there's no synchronization on the lbWatch goroutine.

gyuho · 2017-05-03T20:16:01Z

lgtm. thanks!
/cc @xiang90 @fanminshi

heyitsanthony force-pushed the fix-switch-race branch from 78fba8b to 343d13b Compare May 1, 2017 19:52

heyitsanthony force-pushed the fix-switch-race branch 4 times, most recently from c78e1d1 to 998acd9 Compare May 2, 2017 23:42

heyitsanthony force-pushed the fix-switch-race branch from 998acd9 to 61abf25 Compare May 2, 2017 23:44

integration: close accepted connection on stopc path

61abf25

Connection pausing added another exit condition in the listener path, causing the bridge to leak connections instead of closing them when signalled to close. Also adds some additional Close paranoia. Fixes etcd-io#7823

heyitsanthony mentioned this pull request May 3, 2017

concurrency: provide old STM functions as deprecated #7863

Merged

heyitsanthony merged commit 9fee35b into etcd-io:master May 3, 2017

heyitsanthony deleted the fix-switch-race branch May 3, 2017 20:48

redbaron mentioned this pull request Oct 20, 2017

apiserver timeouts after rolling-update of etcd cluster kubernetes/kubernetes#47131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3: don't race on upc/downc/switch endpoints in balancer #7842

clientv3: don't race on upc/downc/switch endpoints in balancer #7842

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 3, 2017

gyuho commented May 3, 2017

heyitsanthony commented May 3, 2017

gyuho commented May 3, 2017

clientv3: don't race on upc/downc/switch endpoints in balancer #7842

clientv3: don't race on upc/downc/switch endpoints in balancer #7842

Conversation

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 1, 2017

fanminshi commented May 1, 2017

heyitsanthony commented May 3, 2017

gyuho commented May 3, 2017

heyitsanthony commented May 3, 2017

gyuho commented May 3, 2017