etcdmain: configure keep-alive policy from server #8258

gyuho · 2017-07-13T19:06:56Z

Parameters should match the one in server.

https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go#L30

...
// Make sure these parameters are set in coordination with the keepalive policy on the server,
// as incompatible settings can result in closing of connection.
type ClientParameters struct {
  ...

Updates #8199.
Address #8022.

heyitsanthony · 2017-07-19T17:36:23Z

etcdserver/api/v3rpc/keepalive/keepalive.go

@@ -0,0 +1,36 @@
+// Copyright 2017 The etcd Authors


instead of a separate package under etcdserver, could this all go in embed and extend v3rpc.Server() to accept a list of server options?

heyitsanthony · 2017-07-19T17:37:58Z

clientv3/client.go

-		opts = append(opts, grpc.WithKeepaliveParams(params))
+
+	params := keepalive.ClientParameters{
+		Time: kp.Policy.MinTime,


is there a reason why this can't be kept user configurable?

heyitsanthony · 2017-07-19T17:38:42Z

EnforcementPolicy could be an experimental flag. Old clients won't be able to reply to the pings so they'll be periodically disconnected, right?

gyuho · 2017-07-20T18:51:38Z

@heyitsanthony

instead of a separate package under etcdserver, could this all go in embed and extend v3rpc.Server() to accept a list of server options?

We need a way to check whether client is configured to send too_many_pings against servers (slow pings are ok). So there needs to be a separate package to import server-side configuration. If we define that in embed, clientv3 would need to vendor all server-side dependencies.

heyitsanthony · 2017-07-20T19:53:51Z

clientv3/client.go

@@ -217,6 +218,10 @@ func (c *Client) dialSetupOpts(endpoint string, dopts ...grpc.DialOption) (opts
 		opts = []grpc.DialOption{grpc.WithTimeout(c.cfg.DialTimeout)}
 	}
 	if c.cfg.DialKeepAliveTime > 0 {
+		// prevent incompatible settings that can lead to connection close from server-side


It should be OK to let this be misconfigured and let the client have trouble (instead of the panic) since it's opt-in. In the future the etcd server may need <MinTime, and it'd be nicer if older clients could use it without any trouble.

xiang90 · 2017-07-22T17:57:26Z

etcdmain/config.go

@@ -143,6 +144,9 @@ func newConfig() *config {
 	fs.Int64Var(&cfg.QuotaBackendBytes, "quota-backend-bytes", cfg.QuotaBackendBytes, "Raise alarms when backend size exceeds the given quota. 0 means use the default quota.")
 	fs.UintVar(&cfg.MaxTxnOps, "max-txn-ops", cfg.MaxTxnOps, "Maximum number of operations permitted in a transaction.")
 	fs.UintVar(&cfg.MaxRequestBytes, "max-request-bytes", cfg.MaxRequestBytes, "Maximum client request size in bytes the server will accept.")
+	fs.DurationVar(&cfg.KeepAliveEnforcementPolicyMinTime, "keepalive-min-time", time.Duration(0), "Minimum interval duration that a client should wait before pinging server.")


@gyuho @heyitsanthony

should we give a good default value? can we add gRPC prefix to these flags?

heyitsanthony · 2017-07-26T18:26:43Z

@gyuho should grpc health-checking be kept to support #8308? http2 keepalives don't give any indication that the client should change endpoints and even watcher leader checking isn't enough for the general case (e.g., if a put fails on a partitioned member, it'd be nice if the client would try to switch endpoints)

xiang90 · 2017-07-26T18:29:48Z

@heyitsanthony

healthy checking is also helpful to detect http2 blackhole (an application that can talk http2, but not an etcd server).

heyitsanthony · 2017-07-26T18:30:31Z

@xiang90 yes, I think etcd needs both

heyitsanthony

a clientv3/integration or e2e test to confirm watch will detect the dead connection and switch to another endpoint would be good for knowing this works

heyitsanthony · 2017-07-27T22:04:11Z

etcdmain/config.go

@@ -143,6 +144,9 @@ func newConfig() *config {
 	fs.Int64Var(&cfg.QuotaBackendBytes, "quota-backend-bytes", cfg.QuotaBackendBytes, "Raise alarms when backend size exceeds the given quota. 0 means use the default quota.")
 	fs.UintVar(&cfg.MaxTxnOps, "max-txn-ops", cfg.MaxTxnOps, "Maximum number of operations permitted in a transaction.")
 	fs.UintVar(&cfg.MaxRequestBytes, "max-request-bytes", cfg.MaxRequestBytes, "Maximum client request size in bytes the server will accept.")
+	fs.DurationVar(&cfg.GRPCKeepAliveEnforcementPolicyMinTime, "grpc-keepalive-min-time", 5*time.Second, "Minimum interval duration that a client should wait before pinging server.")


this default should be set in embed.NewConfig()

heyitsanthony · 2017-07-27T22:05:57Z

embed/config.go

+	// sends goaway and closes the connection (too_many_pings). When too slow,
+	// nothing happens. Server expects client pings only when there is any
+	// active streams by setting 'PermitWithoutStream' false.
+	GRPCKeepAliveEnforcementPolicyMinTime time.Duration `json:"grpc-keepalive-min-time"`


should the names match the json more closely? Like GRPCKeepAliveMinTime

heyitsanthony · 2017-07-27T22:07:34Z

etcdmain/help.go

@@ -54,6 +54,12 @@ member flags:
 		time (in milliseconds) of a heartbeat interval.
 	--election-timeout '1000'
 		time (in milliseconds) for an election to timeout. See tuning documentation for details.
+	--grpc-keepalive-interval '5s'
+		minimum duration interval that a client should wait before pinging server.
+	--grpc-keepalive-interval '0s'


grpc-keepalive-interval already given above

gyuho · 2017-08-18T16:57:46Z

@heyitsanthony I am tracking down this to grpc, and now Go dial-ing code. The problem is even if we blackholed the endpoint, Go http/dialContext succeeds without failure.

func TestDial(t *testing.T) {
	capnslog.SetGlobalLogLevel(capnslog.CRITICAL)
	clus := integration.NewClusterV3(t, &integration.ClusterConfig{
		Size: 3,
	})
	defer func() {
		capnslog.SetGlobalLogLevel(capnslog.CRITICAL)
		clus.Terminate(t)
	}()

	ccfg := clientv3.Config{
		Endpoints: []string{clus.Members[0].GRPCAddr()},
	}

	fmt.Println("blackhole-d 1", clus.Members[0].GRPCAddr())
	clus.Members[0].Blackhole()
	fmt.Println("blackhole-d 2", clus.Members[0].GRPCAddr())

	time.Sleep(3 * time.Second)

	capnslog.SetGlobalLogLevel(capnslog.INFO)
	cli, err := clientv3.New(ccfg)
	if err != nil {
		t.Fatal(err)
	}
	defer cli.Close()

	fmt.Println("PUT 1")
	if _, err := cli.Put(context.TODO(), "foo", "bar"); err != nil {
		t.Fatal(err)
	}
	fmt.Println("PUT 2")

	fmt.Println("DONE")
}

https://github.com/coreos/etcd/blob/master/clientv3/client.go#L244-L245

dialer := &net.Dialer{Timeout: t}
conn, err := dialer.DialContext(c.ctx, proto, host)

Always succeeds even with blackholed endpoint. Now grpc-go http2 client establishes the transport to this blackholed endpoint, but fails on keepalive, so keep retrying. Keepalive fails because there's no frame received, but the initial connection succeeds.

heyitsanthony · 2017-08-18T17:10:48Z

Dial should be able to connect in this case and retry is OK if there's only one endpoint. Is there starvation when trying to failover to another endpoint?

gyuho · 2017-08-18T17:15:37Z

Dial should be able to connect in this case and retry is OK if there's only one endpoint.

Got it.

Is there starvation when trying to failover to another endpoint?

grpc-go, as long as the connectivity state is not Shutdown, will keep retrying, even if it's the blackholed endpoint. I am experimenting manual-flagging on the blackhole-d endpoint, so that balancer does not switch to blackhole-d ones.

EDIT: The connection works fine on the blackholed endpoint, and grpc-go never calls Up on the blackhole-d endpoint, as long as we mark it as non-active and trigger tearDown on that endpoint by calling notifyAddrs.

TODO: maybe another PR

test Unblackhole
use health service

Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho · 2017-08-21T15:18:57Z

Closing. Need more research in grpc-go logic. Will open a new PR soon.

gyuho added kind/release-note labels Jul 13, 2017

gyuho force-pushed the keepalive branch 3 times, most recently from b2e14fa to 0be8a68 Compare July 15, 2017 02:13

heyitsanthony reviewed Jul 19, 2017

View reviewed changes

etcd-io deleted a comment from codecov-io Jul 20, 2017

gyuho force-pushed the keepalive branch 2 times, most recently from c3561de to 82ccb21 Compare July 20, 2017 18:20

heyitsanthony reviewed Jul 20, 2017

View reviewed changes

gyuho force-pushed the keepalive branch 6 times, most recently from de2e92d to aaf3ae3 Compare July 21, 2017 22:05

xiang90 reviewed Jul 22, 2017

View reviewed changes

gyuho force-pushed the keepalive branch 2 times, most recently from 899b1ae to fcbb66d Compare July 24, 2017 16:12

gyuho force-pushed the keepalive branch from fcbb66d to ef38ac4 Compare July 27, 2017 18:56

gyuho changed the title *: configure keep-alive policy from server, remove health check service *: configure keep-alive policy from server Jul 27, 2017

gyuho force-pushed the keepalive branch from ef38ac4 to 6b7a6bf Compare July 27, 2017 21:19

gyuho removed the WIP label Jul 27, 2017

heyitsanthony suggested changes Jul 27, 2017

View reviewed changes

gyuho added the WIP label Jul 27, 2017

gyuho force-pushed the keepalive branch 13 times, most recently from 43811e4 to 9c716ba Compare August 18, 2017 16:15

gyuho added 5 commits August 19, 2017 09:53

api/rpc: accept grpc.ServerOption's for keepalive policy

6afd2e7

Signed-off-by: Gyu-Ho Lee <[email protected]>

embed: define keepalive server option parameters

5e3b85d

Signed-off-by: Gyu-Ho Lee <[email protected]>

etcdmain: add 'grpc-keepalive-*' flags

c4b2650

Signed-off-by: Gyu-Ho Lee <[email protected]>

integration: add Blackhole to bridgeConn

781e00c

Signed-off-by: Gyu-Ho Lee <[email protected]>

integration: configure keepalive parameters for server

fc698b8

Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho force-pushed the keepalive branch 2 times, most recently from 97bfa47 to e5d8187 Compare August 19, 2017 19:21

clientv3/integration: add TestWatchKeepAlive

9023ba2

Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho force-pushed the keepalive branch from e5d8187 to 572dad3 Compare August 19, 2017 22:32

clientv3: notify active addresses to balancer with keepalive

35e9a88

Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho force-pushed the keepalive branch from 572dad3 to 35e9a88 Compare August 19, 2017 22:57

gyuho closed this Aug 21, 2017

gyuho deleted the keepalive branch August 24, 2017 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdmain: configure keep-alive policy from server #8258

etcdmain: configure keep-alive policy from server #8258

gyuho commented Jul 13, 2017 •

edited

Loading

heyitsanthony Jul 19, 2017

heyitsanthony Jul 19, 2017

heyitsanthony commented Jul 19, 2017

gyuho commented Jul 20, 2017 •

edited

Loading

heyitsanthony Jul 20, 2017

xiang90 Jul 22, 2017

heyitsanthony commented Jul 26, 2017

xiang90 commented Jul 26, 2017

heyitsanthony commented Jul 26, 2017

heyitsanthony left a comment

heyitsanthony Jul 27, 2017

heyitsanthony Jul 27, 2017

heyitsanthony Jul 27, 2017

gyuho commented Aug 18, 2017

heyitsanthony commented Aug 18, 2017

gyuho commented Aug 18, 2017 •

edited

Loading

gyuho commented Aug 21, 2017 •

edited

Loading

etcdmain: configure keep-alive policy from server #8258

etcdmain: configure keep-alive policy from server #8258

Conversation

gyuho commented Jul 13, 2017 • edited Loading

heyitsanthony Jul 19, 2017

Choose a reason for hiding this comment

heyitsanthony Jul 19, 2017

Choose a reason for hiding this comment

heyitsanthony commented Jul 19, 2017

gyuho commented Jul 20, 2017 • edited Loading

heyitsanthony Jul 20, 2017

Choose a reason for hiding this comment

xiang90 Jul 22, 2017

Choose a reason for hiding this comment

heyitsanthony commented Jul 26, 2017

xiang90 commented Jul 26, 2017

heyitsanthony commented Jul 26, 2017

heyitsanthony left a comment

Choose a reason for hiding this comment

heyitsanthony Jul 27, 2017

Choose a reason for hiding this comment

heyitsanthony Jul 27, 2017

Choose a reason for hiding this comment

heyitsanthony Jul 27, 2017

Choose a reason for hiding this comment

gyuho commented Aug 18, 2017

heyitsanthony commented Aug 18, 2017

gyuho commented Aug 18, 2017 • edited Loading

gyuho commented Aug 21, 2017 • edited Loading

gyuho commented Jul 13, 2017 •

edited

Loading

gyuho commented Jul 20, 2017 •

edited

Loading

gyuho commented Aug 18, 2017 •

edited

Loading

gyuho commented Aug 21, 2017 •

edited

Loading