Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdmain: configure keep-alive policy from server #8258

Closed
wants to merge 7 commits into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Jul 13, 2017

Parameters should match the one in server.

https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go#L30

...
// Make sure these parameters are set in coordination with the keepalive policy on the server,
// as incompatible settings can result in closing of connection.
type ClientParameters struct {
  ...

Updates #8199.
Address #8022.

@@ -0,0 +1,36 @@
// Copyright 2017 The etcd Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a separate package under etcdserver, could this all go in embed and extend v3rpc.Server() to accept a list of server options?

opts = append(opts, grpc.WithKeepaliveParams(params))

params := keepalive.ClientParameters{
Time: kp.Policy.MinTime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why this can't be kept user configurable?

@heyitsanthony
Copy link
Contributor

EnforcementPolicy could be an experimental flag. Old clients won't be able to reply to the pings so they'll be periodically disconnected, right?

@etcd-io etcd-io deleted a comment from codecov-io Jul 20, 2017
@gyuho gyuho force-pushed the keepalive branch 2 times, most recently from c3561de to 82ccb21 Compare July 20, 2017 18:20
@gyuho
Copy link
Contributor Author

gyuho commented Jul 20, 2017

@heyitsanthony

instead of a separate package under etcdserver, could this all go in embed and extend v3rpc.Server() to accept a list of server options?

We need a way to check whether client is configured to send too_many_pings against servers (slow pings are ok). So there needs to be a separate package to import server-side configuration. If we define that in embed, clientv3 would need to vendor all server-side dependencies.

@@ -217,6 +218,10 @@ func (c *Client) dialSetupOpts(endpoint string, dopts ...grpc.DialOption) (opts
opts = []grpc.DialOption{grpc.WithTimeout(c.cfg.DialTimeout)}
}
if c.cfg.DialKeepAliveTime > 0 {
// prevent incompatible settings that can lead to connection close from server-side
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK to let this be misconfigured and let the client have trouble (instead of the panic) since it's opt-in. In the future the etcd server may need <MinTime, and it'd be nicer if older clients could use it without any trouble.

@gyuho gyuho force-pushed the keepalive branch 6 times, most recently from de2e92d to aaf3ae3 Compare July 21, 2017 22:05
@@ -143,6 +144,9 @@ func newConfig() *config {
fs.Int64Var(&cfg.QuotaBackendBytes, "quota-backend-bytes", cfg.QuotaBackendBytes, "Raise alarms when backend size exceeds the given quota. 0 means use the default quota.")
fs.UintVar(&cfg.MaxTxnOps, "max-txn-ops", cfg.MaxTxnOps, "Maximum number of operations permitted in a transaction.")
fs.UintVar(&cfg.MaxRequestBytes, "max-request-bytes", cfg.MaxRequestBytes, "Maximum client request size in bytes the server will accept.")
fs.DurationVar(&cfg.KeepAliveEnforcementPolicyMinTime, "keepalive-min-time", time.Duration(0), "Minimum interval duration that a client should wait before pinging server.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho @heyitsanthony

should we give a good default value? can we add gRPC prefix to these flags?

@gyuho gyuho force-pushed the keepalive branch 2 times, most recently from 899b1ae to fcbb66d Compare July 24, 2017 16:12
@heyitsanthony
Copy link
Contributor

@gyuho should grpc health-checking be kept to support #8308? http2 keepalives don't give any indication that the client should change endpoints and even watcher leader checking isn't enough for the general case (e.g., if a put fails on a partitioned member, it'd be nice if the client would try to switch endpoints)

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@heyitsanthony

healthy checking is also helpful to detect http2 blackhole (an application that can talk http2, but not an etcd server).

@heyitsanthony
Copy link
Contributor

@xiang90 yes, I think etcd needs both

@gyuho gyuho changed the title *: configure keep-alive policy from server, remove health check service *: configure keep-alive policy from server Jul 27, 2017
@gyuho gyuho removed the WIP label Jul 27, 2017
Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a clientv3/integration or e2e test to confirm watch will detect the dead connection and switch to another endpoint would be good for knowing this works

@@ -143,6 +144,9 @@ func newConfig() *config {
fs.Int64Var(&cfg.QuotaBackendBytes, "quota-backend-bytes", cfg.QuotaBackendBytes, "Raise alarms when backend size exceeds the given quota. 0 means use the default quota.")
fs.UintVar(&cfg.MaxTxnOps, "max-txn-ops", cfg.MaxTxnOps, "Maximum number of operations permitted in a transaction.")
fs.UintVar(&cfg.MaxRequestBytes, "max-request-bytes", cfg.MaxRequestBytes, "Maximum client request size in bytes the server will accept.")
fs.DurationVar(&cfg.GRPCKeepAliveEnforcementPolicyMinTime, "grpc-keepalive-min-time", 5*time.Second, "Minimum interval duration that a client should wait before pinging server.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this default should be set in embed.NewConfig()

embed/config.go Outdated
// sends goaway and closes the connection (too_many_pings). When too slow,
// nothing happens. Server expects client pings only when there is any
// active streams by setting 'PermitWithoutStream' false.
GRPCKeepAliveEnforcementPolicyMinTime time.Duration `json:"grpc-keepalive-min-time"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the names match the json more closely? Like GRPCKeepAliveMinTime

etcdmain/help.go Outdated
@@ -54,6 +54,12 @@ member flags:
time (in milliseconds) of a heartbeat interval.
--election-timeout '1000'
time (in milliseconds) for an election to timeout. See tuning documentation for details.
--grpc-keepalive-interval '5s'
minimum duration interval that a client should wait before pinging server.
--grpc-keepalive-interval '0s'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpc-keepalive-interval already given above

@gyuho gyuho added the WIP label Jul 27, 2017
@gyuho gyuho force-pushed the keepalive branch 13 times, most recently from 43811e4 to 9c716ba Compare August 18, 2017 16:15
@gyuho
Copy link
Contributor Author

gyuho commented Aug 18, 2017

@heyitsanthony I am tracking down this to grpc, and now Go dial-ing code. The problem is even if we blackholed the endpoint, Go http/dialContext succeeds without failure.

func TestDial(t *testing.T) {
	capnslog.SetGlobalLogLevel(capnslog.CRITICAL)
	clus := integration.NewClusterV3(t, &integration.ClusterConfig{
		Size: 3,
	})
	defer func() {
		capnslog.SetGlobalLogLevel(capnslog.CRITICAL)
		clus.Terminate(t)
	}()

	ccfg := clientv3.Config{
		Endpoints: []string{clus.Members[0].GRPCAddr()},
	}

	fmt.Println("blackhole-d 1", clus.Members[0].GRPCAddr())
	clus.Members[0].Blackhole()
	fmt.Println("blackhole-d 2", clus.Members[0].GRPCAddr())

	time.Sleep(3 * time.Second)

	capnslog.SetGlobalLogLevel(capnslog.INFO)
	cli, err := clientv3.New(ccfg)
	if err != nil {
		t.Fatal(err)
	}
	defer cli.Close()

	fmt.Println("PUT 1")
	if _, err := cli.Put(context.TODO(), "foo", "bar"); err != nil {
		t.Fatal(err)
	}
	fmt.Println("PUT 2")

	fmt.Println("DONE")
}

https://github.com/coreos/etcd/blob/master/clientv3/client.go#L244-L245

dialer := &net.Dialer{Timeout: t}
conn, err := dialer.DialContext(c.ctx, proto, host)

Always succeeds even with blackholed endpoint. Now grpc-go http2 client establishes the transport to this blackholed endpoint, but fails on keepalive, so keep retrying. Keepalive fails because there's no frame received, but the initial connection succeeds.

@heyitsanthony
Copy link
Contributor

Dial should be able to connect in this case and retry is OK if there's only one endpoint. Is there starvation when trying to failover to another endpoint?

@gyuho
Copy link
Contributor Author

gyuho commented Aug 18, 2017

Dial should be able to connect in this case and retry is OK if there's only one endpoint.

Got it.

Is there starvation when trying to failover to another endpoint?

grpc-go, as long as the connectivity state is not Shutdown, will keep retrying, even if it's the blackholed endpoint. I am experimenting manual-flagging on the blackhole-d endpoint, so that balancer does not switch to blackhole-d ones.

EDIT: The connection works fine on the blackholed endpoint, and grpc-go never calls Up on the blackhole-d endpoint, as long as we mark it as non-active and trigger tearDown on that endpoint by calling notifyAddrs.

TODO: maybe another PR

  1. test Unblackhole
  2. use health service

@gyuho gyuho force-pushed the keepalive branch 2 times, most recently from 97bfa47 to e5d8187 Compare August 19, 2017 19:21
@gyuho
Copy link
Contributor Author

gyuho commented Aug 21, 2017

Closing. Need more research in grpc-go logic. Will open a new PR soon.

@gyuho gyuho closed this Aug 21, 2017
@gyuho gyuho deleted the keepalive branch August 24, 2017 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants