boltdb panic while removing member from cluster #7322

dennwc · 2017-02-14T14:04:02Z

We have a cluster of 3 nodes in embed mode. If we try to remove node 2 from cluster (request from node 1) we see a panic in logs of node 2:

017-02-14 12:38:55.399045 I | rafthttp: stopped peer ef67b54f03db7ce0
panic: assertion failed: tx closed

goroutine 1082253 [running]:
panic(0x139ae00, 0xc42bff1170)
        /goroot/src/runtime/panic.go:500 +0x1a1
vendor/github.com/boltdb/bolt._assert(0x14e1c00, 0x163c179, 0x9, 0x0, 0x0, 0x0)
        vendor/github.com/boltdb/bolt/db.go:1029 +0xff
vendor/github.com/boltdb/bolt.(*Cursor).seek(0xc4329f0e88, 0x20e2656, 0x3, 0x3, 0x0, 0x0, 0xc42bcac5c0, 0xc421114400, 0x0, 0x2, ...)
        vendor/github.com/boltdb/bolt/cursor.go:155 +0x6f
vendor/github.com/boltdb/bolt.(*Bucket).Bucket(0xc4215c2638, 0x20e2656, 0x3, 0x3, 0x100000000000000)
        vendor/github.com/boltdb/bolt/bucket.go:112 +0x108
vendor/github.com/boltdb/bolt.(*Tx).Bucket(0xc4215c2620, 0x20e2656, 0x3, 0x3, 0x7f743588e000)
        vendor/github.com/boltdb/bolt/tx.go:101 +0x4f
vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc420e61ac0, 0x20e2656, 0x3, 0x3, 0xc42121fea0, 0x11, 0x12, 0xc42121fec0, 0x11, 0x12, ...)
        vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:88 +0x79
vendor/github.com/coreos/etcd/mvcc.(*store).rangeKeys(0xc420067ee0, 0xc432967b90, 0x69, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42121fe00, ...)
        vendor/github.com/coreos/etcd/mvcc/kvstore.go:524 +0x216
vendor/github.com/coreos/etcd/mvcc.(*store).TxnRange(0xc420067ee0, 0x2552232b93996851, 0xc432967b90, 0x69, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        vendor/github.com/coreos/etcd/mvcc/kvstore.go:230 +0xc9
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc420118350, 0x2552232b93996851, 0xc432967b20, 0x0, 0x2552232b93996851, 0xc4201af000)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:279 +0x1a8
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).applyUnion(0xc420118350, 0x2552232b93996851, 0xc42bff10e0, 0xc42f3c8400)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:474 +0xeb
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Txn(0xc420118350, 0xc4264f7e50, 0xeca9e0, 0x2102a00, 0xc420e55650)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:380 +0x211
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Txn.func2()
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:184 +0x5b
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).doSerialize(0xc4200c5b80, 0x7f74357b8908, 0xc43295c750, 0xc42121fe20, 0xc4329f1780, 0x0, 0x0)
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:680 +0xfe
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Txn(0xc4200c5b80, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0xf14d5b, 0xc420e08180, 0x156b0c0)
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:185 +0x287
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*kvServer).Txn(0xc4202ef300, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0xc4264f7e50, 0x0, 0x0)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/key.go:101 +0x88
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc4202ef340, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0x5, 0x164e79c, 0xf)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/quota.go:69 +0xa2
vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Txn_Handler.func1(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc4264f7ea0, 0xc4329f1a08, 0xb0c8d6, 0xc43295c750)
        vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2341 +0xa2
vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc42121fd60, 0xc42121fd80, 0x419938, 0x20, 0x1491500, 0x1)
        vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xa4
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc42121fd60, 0xc42121fd80, 0x71, 0xc420eb6b28, 0x0, 0xc4215bbb08)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0xc3
vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Txn_Handler(0x14c2660, 0xc4202ef340, 0x7f74357b8908, 0xc43295c750, 0xc423563920, 0xc420e080b0, 0x0, 0x0, 0x4, 0xc420eb6e10)
        vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2343 +0x166
vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100, 0xc4211bcf60, 0x20daae8, 0x0, 0x0, 0x0)
        vendor/google.golang.org/grpc/server.go:708 +0xbd8
vendor/google.golang.org/grpc.(*Server).handleStream(0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100, 0x0)
        vendor/google.golang.org/grpc/server.go:892 +0x6a4
vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc42bff1060, 0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100)
        vendor/google.golang.org/grpc/server.go:468 +0xab
created by vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        vendor/google.golang.org/grpc/server.go:469 +0xa3

Probably node 2 is doing some requests to KV during the process (from our side).

etcd: v3.1.0, 8ba2897a21e4fc51b298ca553d251318425f93ae
bolt: e9cf4fae01b5a8ff89d0ec6b32f0d9c9f79aefdd

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-02-14T15:26:17Z

Can you easily reproduce this?

dennwc · 2017-02-14T15:27:31Z

@xiang90 Yes, but we don't have a small reproducer for now

xiang90 · 2017-02-14T15:30:09Z

OK. It seems like we closed the db without waiting for all tx to finish.

We wil take a look. Thank you for reporting.

jonsyu1 · 2017-02-28T20:57:34Z

We ran into this without removing a member from the cluster. Unfortunately, I do not have concrete repro steps, but we got the same panic after AWS restarted our machines hosting our etcd cluster. We will post more information if we are able to reproduce this failure mode.

gyuho · 2017-03-07T23:54:47Z

@heyitsanthony @xiang90

If there's an inflight request to backend, t.tx.DB() == nil could happen

e.g. here https://github.com/coreos/etcd/blob/master/mvcc/backend/batch_tx.go#L87-L88

// UnsafeRange must be called holding the lock on the tx.
func (t *batchTx) UnsafeRange(bucketName []byte, key, endKey []byte, limit int64) (keys [][]byte, vs [][]byte) {
        // inflight db close
        // t.tx.DB() == nil, so it will panic with
        // panic: assertion failed: tx closed
	bucket := t.tx.Bucket(bucketName)

Simple test case in etcd/integration

func TestV3KVInflightRangeOnClosedBackend(t *testing.T) {
	defer testutil.AfterTest(t)
	clus := NewClusterV3(t, &ClusterConfig{Size: 1})
	defer clus.Terminate(t)

	cli := clus.RandClient()
	kvc := toGRPC(cli).KV

	if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}); err != nil {
		panic(err)
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Second)

	donec := make(chan struct{})
	go func() {
		defer close(donec)
		kvc.Range(ctx, &pb.RangeRequest{Key: []byte("foo"), Serializable: true}, grpc.FailFast(false))
	}()

	// instead of clus.Members[0].s.HardStop()
	// to simulate inflight db closing
	clus.Members[0].s.Backend().Close()
	cancel()

	<-donec
}

Any thoughts?

heyitsanthony · 2017-03-08T01:00:56Z

This is the same problem from #6662; it never really fixed the underlying issue of crashing on in-flight backend operations.

xiang90 · 2017-03-16T18:41:46Z

@gyuho @heyitsanthony

Is there a way to stop gRPC and drain pending requests?

xiang90 · 2017-03-16T18:44:06Z

@gyuho

have you tried https://github.com/grpc/grpc-go/blob/master/server.go#L993?

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

heyitsanthony added the type/bug label Feb 14, 2017

heyitsanthony assigned gyuho Feb 14, 2017

heyitsanthony added this to the v3.2.0 milestone Feb 28, 2017

derekwaynecarr mentioned this issue Mar 7, 2017

flake: TestStoreWatch panicked openshift/origin#13068

Closed

deads2k mentioned this issue Mar 14, 2017

TestStoreWatch {registry} kubernetes/kubernetes#34953

Closed

gyuho changed the title ~~Panic while removing member from cluster~~ boltdb panic while removing member from cluster Mar 17, 2017

This was referenced Mar 17, 2017

boltdb panic when defragmenting in functional-tester #7526

Closed

*: use Go 1.9+ #6174

Closed

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

embed: use gRPC server GracefulStop method

90babfc

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

etcdserver: add 'OnShutdown', stop gRPC server

31597ea

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho mentioned this issue Apr 14, 2017

*: use gRPC server GracefulStop #7743

Merged

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

etcdserver: add 'OnShutdown', stop gRPC server

6e207e4

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

etcdserver: add 'OnShutdown', stop gRPC server

ef04f2d

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

embed: use gRPC server GracefulStop method

64c42cb

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017

etcdserver: add 'OnShutdown', stop gRPC server

4fbca50

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

embed: use gRPC server GracefulStop method

c147177

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

etcdserver: add 'OnShutdown', stop gRPC server

c302eb1

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

etcdserver: add 'OnShutdown', stop gRPC server

981f9ec

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

embed: use gRPC server GracefulStop method

2b93339

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

integration: test 'inflight' range requests

73e40ae

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

embed: gracefully shut down gRPC server

23c50ed

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

integration: test 'inflight' range requests

7da838b

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

embed: gracefully shut down gRPC server

77da6ff

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

integration: test 'inflight' range requests

9ae7e94

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

embed: gracefully shut down gRPC server

b7f27d9

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017

integration: test 'inflight' range requests

b398497

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 17, 2017

embed: gracefully shut down gRPC server

c407e09

Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho added a commit to gyuho/etcd that referenced this issue Apr 17, 2017

integration: test 'inflight' range requests

472a536

- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>

gyuho closed this as completed in #7743 Apr 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boltdb panic while removing member from cluster #7322

boltdb panic while removing member from cluster #7322

dennwc commented Feb 14, 2017

xiang90 commented Feb 14, 2017

dennwc commented Feb 14, 2017

xiang90 commented Feb 14, 2017

jonsyu1 commented Feb 28, 2017

gyuho commented Mar 7, 2017 •

edited

Loading

heyitsanthony commented Mar 8, 2017

xiang90 commented Mar 16, 2017

xiang90 commented Mar 16, 2017

boltdb panic while removing member from cluster #7322

boltdb panic while removing member from cluster #7322

Comments

dennwc commented Feb 14, 2017

xiang90 commented Feb 14, 2017

dennwc commented Feb 14, 2017

xiang90 commented Feb 14, 2017

jonsyu1 commented Feb 28, 2017

gyuho commented Mar 7, 2017 • edited Loading

heyitsanthony commented Mar 8, 2017

xiang90 commented Mar 16, 2017

xiang90 commented Mar 16, 2017

gyuho commented Mar 7, 2017 •

edited

Loading