-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
boltdb panic while removing member from cluster #7322
Comments
Can you easily reproduce this? |
@xiang90 Yes, but we don't have a small reproducer for now |
OK. It seems like we closed the db without waiting for all tx to finish. We wil take a look. Thank you for reporting. |
We ran into this without removing a member from the cluster. Unfortunately, I do not have concrete repro steps, but we got the same panic after AWS restarted our machines hosting our etcd cluster. We will post more information if we are able to reproduce this failure mode. |
If there's an inflight request to backend, e.g. here https://github.com/coreos/etcd/blob/master/mvcc/backend/batch_tx.go#L87-L88 // UnsafeRange must be called holding the lock on the tx.
func (t *batchTx) UnsafeRange(bucketName []byte, key, endKey []byte, limit int64) (keys [][]byte, vs [][]byte) {
// inflight db close
// t.tx.DB() == nil, so it will panic with
// panic: assertion failed: tx closed
bucket := t.tx.Bucket(bucketName) Simple test case in func TestV3KVInflightRangeOnClosedBackend(t *testing.T) {
defer testutil.AfterTest(t)
clus := NewClusterV3(t, &ClusterConfig{Size: 1})
defer clus.Terminate(t)
cli := clus.RandClient()
kvc := toGRPC(cli).KV
if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}); err != nil {
panic(err)
}
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
donec := make(chan struct{})
go func() {
defer close(donec)
kvc.Range(ctx, &pb.RangeRequest{Key: []byte("foo"), Serializable: true}, grpc.FailFast(false))
}()
// instead of clus.Members[0].s.HardStop()
// to simulate inflight db closing
clus.Members[0].s.Backend().Close()
cancel()
<-donec
} Any thoughts? |
This is the same problem from #6662; it never really fixed the underlying issue of crashing on in-flight backend operations. |
Is there a way to stop gRPC and drain pending requests? |
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>
Fix etcd-io#7322. Signed-off-by: Gyu-Ho Lee <[email protected]>
- Test etcd-io#7322. - Remove test case added in etcd-io#6662. Signed-off-by: Gyu-Ho Lee <[email protected]>
We have a cluster of 3 nodes in embed mode. If we try to remove node 2 from cluster (request from node 1) we see a panic in logs of node 2:
Probably node 2 is doing some requests to KV during the process (from our side).
etcd: v3.1.0,
8ba2897a21e4fc51b298ca553d251318425f93ae
bolt:
e9cf4fae01b5a8ff89d0ec6b32f0d9c9f79aefdd
The text was updated successfully, but these errors were encountered: