*: use gRPC server GracefulStop #7743

gyuho · 2017-04-14T16:27:22Z

Example output

2017-04-14 09:20:53.599484 I | etcdserver/api: enabled capabilities for version 3.2
^C2017-04-14 09:20:55.974829 N | pkg/osutil: received interrupt signal, shutting down...
2017-04-14 09:20:55.974857 I | etcdserver: skipped leadership transfer for single member cluster
2017-04-14 09:20:55.974875 W | embed: gracefully stopping gRPC server
2017-04-14 09:20:55.974883 W | embed: gracefully stopped gRPC server
2017-04-14 09:20:55.975058 W | embed: server stopped with "grpc: the server has been stopped"
2017-04-14 09:20:55.975482 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975503 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975523 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975540 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975559 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975598 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2017-04-14 09:20:55.975628 W | embed: server stopped with "accept tcp 127.0.0.1:2379: use of closed network connection"
2017-04-14 09:20:55.975667 W | embed: server stopped with "mux: listener closed"
2017-04-14 09:20:55.975712 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975729 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975743 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}

Fix #7322.

heyitsanthony · 2017-04-14T16:37:36Z

This should fix the inflight op crashes, right? Can there be a test?

gyuho · 2017-04-14T16:39:02Z

@heyitsanthony Yes, I will try to verify this fixes that problem by adding tests or reproduce.

gyuho · 2017-04-14T18:13:12Z

@heyitsanthony Test added. Confirmed that it fixes the issue (use reqN := 500 in the test, and comment out graceful stop part, then it will panic in boltdb). PTAL.

heyitsanthony · 2017-04-14T18:15:01Z

embed/serve.go

@@ -52,11 +52,12 @@ type serveCtx struct {

 	userHandlers    map[string]http.Handler
 	serviceRegister func(*grpc.Server)
+	stopGRPCc       chan func()


grpcServers []*gprc.Server

heyitsanthony · 2017-04-14T18:15:32Z

embed/serve.go

@@ -74,6 +75,12 @@ func (sctx *serveCtx) serve(s *etcdserver.EtcdServer, tlscfg *tls.Config, handle

 	if sctx.insecure {
 		gs := v3rpc.Server(s, nil)
+		sctx.stopGRPCc <- func() {


sctx.grpcServers = sctx.append(sctx.grpcServers, gs)

heyitsanthony · 2017-04-14T18:15:37Z

embed/serve.go

@@ -103,6 +110,12 @@ func (sctx *serveCtx) serve(s *etcdserver.EtcdServer, tlscfg *tls.Config, handle

 	if sctx.secure {
 		gs := v3rpc.Server(s, tlscfg)
+		sctx.stopGRPCc <- func() {


sctx.grpcServers = sctx.append(sctx.grpcServers, gs)

heyitsanthony · 2017-04-14T18:16:29Z

etcdserver/config.go

@@ -61,6 +61,12 @@ type ServerConfig struct {
 	ClientCertAuthEnabled bool

 	AuthToken string
+
+	// OnShutdown gracefully stops gRPC server on shutdown.


this should be a generic thing instead of talking about grpc, etc

// OnShutdown is called immediately before releasing etcd server resources.

heyitsanthony · 2017-04-14T18:19:40Z

integration/v3_grpc_inflight_test.go

@@ -75,3 +76,38 @@ func TestV3MaintenanceDefragmentInflightRange(t *testing.T) {

 	<-donec
 }
+


there were some changes to the mvcc code so TestV3MaintenanceHashInflight would appear to work. Namely, TestStoreHashAfterForceCommit and the stopc logic in Hash should probably be removed.

heyitsanthony · 2017-04-14T18:20:15Z

integration/v3_grpc_inflight_test.go

+	kvc := toGRPC(cli).KV
+
+	if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}); err != nil {
+		panic(err)


t.Fatal(err)

heyitsanthony · 2017-04-14T18:22:40Z

embed/etcd.go

@@ -137,6 +139,11 @@ func StartEtcd(inCfg *Config) (e *Etcd, err error) {
 	if err = e.serve(); err != nil {
 		return
 	}
+	e.Server.Cfg.OnShutdown = func() {


OnShutdown = func() { for _, sctx := range e.sctxs { for _, gs := range sctx.grpcServers { gs.GracefulStop() } } }

I think this does not sync with serve routine? We populate sctx.grpcServers
in (e *Etcd) serve() (which calls (sctx *serveCtx) serve that creates *grpc.Server.
But (e *Etcd) serve() returns calling (sctx *serveCtx) serve in goroutines.

nvm, we somehow need a way to sync that slice anyway.

heyitsanthony · 2017-04-14T18:23:59Z

embed/etcd.go

@@ -343,6 +350,10 @@ func (e *Etcd) serve() (err error) {
 }

 func (e *Etcd) errHandler(err error) {
+	if transport.IsClosedConnError(err) || err == grpc.ErrServerStopped {


why is this necessary? shouldn't stopc be closed before calling etcdserver.Stop?

// in embed/etcd.go func StartEtcd(inCfg *Config) (e *Etcd, err error) { if err = inCfg.Validate(); err != nil { return nil, err } e = &Etcd{cfg: *inCfg, stopc: make(chan struct{})} cfg := &e.cfg defer func() { if e != nil && err != nil { e.Close() e = nil } }()

We close stopc by calling e.Close() here, but StartEtcd returns with nil error, so it is not called in our use case?

heyitsanthony · 2017-04-14T18:28:30Z

etcdserver/server.go

+		// stop accepting new connections, RPCs,
+		// and blocks until all pending RPCs are finished
+		if s.Cfg != nil && s.Cfg.OnShutdown != nil {
+			s.Cfg.OnShutdown()


I think it's possible to avoid having this in Cfg entirely-- this function could be called prior to calling HardStop/Stop; similar to how the listeners are closed in embed.Etcd.Close() before calling Server.Stop()

Ok let me re-organize the code.

gyuho · 2017-04-15T01:45:24Z

I think we still need GracefulStop in etcdserver because embed.Etcd.Close won't be triggered unless there's an error at the beginning of serving. And etcdserver.EtcdServer.Stop is the handler registered for OS interrupt signals.

// etcdmain/etcd.go

// startEtcd runs StartEtcd in addition to hooks needed for standalone etcd.
func startEtcd(cfg *embed.Config) (<-chan struct{}, <-chan error, error) {
	if cfg.Metrics == "extensive" {
		grpc_prometheus.EnableHandlingTimeHistogram()
	}

	e, err := embed.StartEtcd(cfg)
	if err != nil {
		return nil, nil, err
	}
	osutil.RegisterInterruptHandler(e.Server.Stop)

heyitsanthony · 2017-04-15T02:12:08Z

why not have osutil.RegisterInterruptHandler(e.Close)?

codecov-io · 2017-04-15T16:23:30Z

Codecov Report

❗ No coverage uploaded for pull request base (master@0d52598). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #7743   +/-   ##
=========================================
  Coverage          ?   75.73%           
=========================================
  Files             ?      331           
  Lines             ?    26058           
  Branches          ?        0           
=========================================
  Hits              ?    19735           
  Misses            ?     4899           
  Partials          ?     1424

Impacted Files	Coverage Δ
mvcc/kvstore.go	`87.89% <ø> (ø)`
embed/etcd.go	`67.89% <100%> (ø)`
embed/serve.go	`74.33% <100%> (ø)`
etcdmain/etcd.go	`45.49% <100%> (ø)`
integration/cluster.go	`85.51% <100%> (ø)`
mvcc/backend/batch_tx.go	`78.57% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d52598...5000d29. Read the comment docs.

heyitsanthony · 2017-04-17T16:38:59Z

embed/etcd.go

@@ -147,6 +161,7 @@ func (e *Etcd) Config() Config {

 func (e *Etcd) Close() {
 	e.closeOnce.Do(func() { close(e.stopc) })
+	e.OnShutdown()


can the function be inlined here instead of needing a separate OnShutdown field?

heyitsanthony · 2017-04-17T16:39:25Z

embed/etcd.go

+		// RPCs, and blocks until all pending RPCs are finished
+		for _, sctx := range e.sctxs {
+			for gs := range sctx.grpcServerC {
+				plog.Warning("gracefully stopping gRPC server")


don't warn / print anything? this should be part of the normal shutdown process

heyitsanthony · 2017-04-17T16:43:30Z

mvcc/backend/batch_tx.go

-			}
+			// t.tx.DB()==nil if 'CommitAndStop' calls 'batchTx.commit(true)',
+			// which initializes *bolt.Tx.db and *bolt.Tx.meta as nil; panics t.tx.Size().
+			// Server must make sure 'batchTx.commit(false)' does not follow


This probably shouldn't mention the etcd server or gRPC. The contract is independent of all that-- don't have any operations inflight when closing the backend.

heyitsanthony · 2017-04-17T16:45:50Z

integration/v3_grpc_inflight_test.go

-	mvc := toGRPC(cli).Maintenance
-	mvc.Defragment(context.Background(), &pb.DefragmentRequest{})
+	// simulate 'embed.Etcd.Close()' with '*grpc.Server.GracefulStop'
+	clus.Members[0].grpcServer.GracefulStop()


clus.Members[0].Stop()

heyitsanthony · 2017-04-17T16:47:11Z

mvcc/kvstore_test.go

@@ -518,20 +518,6 @@ func newTestKeyBytes(rev revision, tombstone bool) []byte {
 	return bytes
 }

-// TestStoreHashAfterForceCommit ensures that later Hash call to
-// closed backend with ForceCommit does not panic.
-func TestStoreHashAfterForceCommit(t *testing.T) {


also remove the select in mvcc.store.Hash, which was faking this

Signed-off-by: Gyu-Ho Lee <[email protected]>

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

This reverts commit 994e8e4. Since now etcdserver gracefully shuts down the gRPC server

Revert etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Revert change in etcd-io@33acbb6. Signed-off-by: Gyu-Ho Lee <[email protected]>

FingerLiu · 2021-04-07T11:05:05Z

@gyuho is there a workaround in early versions?

gyuho requested a review from heyitsanthony April 14, 2017 16:27

gyuho added the WIP label Apr 14, 2017

gyuho force-pushed the shutdown-grpc-server branch from 31597ea to 6e207e4 Compare April 14, 2017 16:35

gyuho force-pushed the shutdown-grpc-server branch 3 times, most recently from 32d93e3 to 6686634 Compare April 14, 2017 18:12

gyuho removed the WIP label Apr 14, 2017

heyitsanthony suggested changes Apr 14, 2017

View reviewed changes

gyuho force-pushed the shutdown-grpc-server branch 3 times, most recently from f62b98b to 6779d6d Compare April 15, 2017 01:22

gyuho added the WIP label Apr 15, 2017

gyuho force-pushed the shutdown-grpc-server branch from 6779d6d to bb673bf Compare April 15, 2017 02:57

gyuho removed the WIP label Apr 15, 2017

gyuho closed this Apr 15, 2017

gyuho force-pushed the shutdown-grpc-server branch from bb673bf to e2d0db9 Compare April 15, 2017 14:47

gyuho reopened this Apr 15, 2017

gyuho force-pushed the shutdown-grpc-server branch from d8ebdd3 to 5aea157 Compare April 15, 2017 14:52

heyitsanthony suggested changes Apr 17, 2017

View reviewed changes

gyuho added the WIP label Apr 17, 2017

gyuho added 4 commits April 17, 2017 14:07

etcdmain: trigger embed.Etcd.Close for OS interrupt

ea5f6da

Signed-off-by: Gyu-Ho Lee <[email protected]>

embed: gracefully shut down gRPC server

c407e09

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

integration: test 'inflight' range requests

472a536

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Revert "mvcc: test inflight Hash to trigger Size on nil db"

cd470f9

This reverts commit 994e8e4. Since now etcdserver gracefully shuts down the gRPC server

gyuho added 2 commits April 17, 2017 14:17

mvcc/backend: remove t.tx.DB()==nil checks with GracefulStop

8ffd58f

Revert etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

mvcc: remove stopc select case in Hash

5000d29

Revert change in etcd-io@33acbb6. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho force-pushed the shutdown-grpc-server branch from 5aea157 to 5000d29 Compare April 17, 2017 21:20

gyuho removed the WIP label Apr 17, 2017

heyitsanthony approved these changes Apr 17, 2017

View reviewed changes

gyuho merged commit e771c60 into etcd-io:master Apr 18, 2017

gyuho deleted the shutdown-grpc-server branch April 18, 2017 00:12

gyuho added the kind/release-note label Apr 18, 2017

gyuho mentioned this pull request Apr 18, 2017

*: use Go 1.9+ #6174

Closed

26 tasks

heyitsanthony mentioned this pull request May 3, 2017

TestV3MaintenanceHashInflight: unexpected fault address #6989

Closed

gyuho mentioned this pull request Jun 16, 2017

ETCD panics on bolt assertion #8118

Closed

abel-von mentioned this pull request Sep 21, 2017

the kube-apiserver UT failed because of the mock etcd server panic #8589

Closed

gyuho removed the kind/need-release-note-v3.3 label Nov 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: use gRPC server GracefulStop #7743

*: use gRPC server GracefulStop #7743

gyuho commented Apr 14, 2017

heyitsanthony commented Apr 14, 2017

gyuho commented Apr 14, 2017 •

edited

Loading

gyuho commented Apr 14, 2017 •

edited

Loading

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

heyitsanthony Apr 14, 2017

gyuho Apr 15, 2017

gyuho Apr 15, 2017 •

edited

Loading

heyitsanthony Apr 14, 2017

gyuho Apr 15, 2017

heyitsanthony Apr 14, 2017

gyuho Apr 14, 2017

gyuho commented Apr 15, 2017

heyitsanthony commented Apr 15, 2017

codecov-io commented Apr 15, 2017 •

edited

Loading

heyitsanthony Apr 17, 2017

heyitsanthony Apr 17, 2017

heyitsanthony Apr 17, 2017

heyitsanthony Apr 17, 2017

heyitsanthony Apr 17, 2017

FingerLiu commented Apr 7, 2021

		@@ -75,3 +76,38 @@ func TestV3MaintenanceDefragmentInflightRange(t *testing.T) {

		<-donec
		}

*: use gRPC server GracefulStop #7743

*: use gRPC server GracefulStop #7743

Conversation

gyuho commented Apr 14, 2017

heyitsanthony commented Apr 14, 2017

gyuho commented Apr 14, 2017 • edited Loading

gyuho commented Apr 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho Apr 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented Apr 15, 2017

heyitsanthony commented Apr 15, 2017

codecov-io commented Apr 15, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FingerLiu commented Apr 7, 2021

gyuho commented Apr 14, 2017 •

edited

Loading

gyuho commented Apr 14, 2017 •

edited

Loading

gyuho Apr 15, 2017 •

edited

Loading

codecov-io commented Apr 15, 2017 •

edited

Loading