PD is corrupting `etcd` database on restart #8547

faelau · 2024-08-19T15:30:02Z

Bug Report

If you restart a PD pod, you receive the following panic:

[2024/08/19 15:16:25.624 +00:00] [WARN] [server.go:297] ["exceeded recommended request limit"] [max-request-bytes=157286400] [max-request-size="157 MB"] [recommended-request-bytes=10485760] [recommended-request-size="10 MB"]
2024-08-19 15:16:25.624904 W | pkg/fileutil: check file permission: directory "/var/lib/pd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
[2024/08/19 15:16:25.636 +00:00] [PANIC] [backend.go:173] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="invalid database"]
panic: failed to open database
goroutine 251 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0x2?, {0x0?, 0x0?, 0xc0001364a0?})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0006f52b0, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc001299f80?, {0x304f490?, 0x16?}, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x51
go.etcd.io/etcd/mvcc/backend.newBackend({{0xc001299f80, 0x1a}, 0x5f5e100, 0x2710, {0x30191e2, 0x5}, 0x233333333, 0xc000053980, 0x0})
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:173 +0x35c
go.etcd.io/etcd/mvcc/backend.New(...)
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:151
go.etcd.io/etcd/etcdserver.newBackend({{0x7ffd58f397d8, 0xe}, {0x0, 0x0}, {0x0, 0x0}, {0xc0003086c0, 0x1, 0x1}, {0xc000308480, ...}, ...})
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:53 +0x3b0
go.etcd.io/etcd/etcdserver.openBackend.func1()
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:74 +0x45
created by go.etcd.io/etcd/etcdserver.openBackend in goroutine 1
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:73 +0x106

The PVC hat the cephfs.csi.ceph.com provisioner. The cluster is running on microk8s.

Checking the etcd database with bbolt, digging a bit deeper results in the following error:

$ ./go/bin/bbolt page --all --format-value=redacted db
cannot read number of pages: the Meta Page has wrong (unexpected) magic

What did you do?

Create a new TidbCluster:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: surrealdb
spec:
  version: v8.2.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    replicas: 1
    maxFailoverCount: 0
    mountClusterClientSecret: true
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    evictLeaderTimeout: 1m
    replicas: 3
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config:
      storage:
        reserve-space: "0MB"
      rocksdb:
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 5
    service:
      type: ClusterIP
    config: {}

Restart PD pod (e.g. if you drain a node on updating Kubernetes)
Getting the panic

What did you expect to see?

pd not corrupting the etcd database.

What did you see instead?

A panic of the PD container because the etcd database is corrupted.

What version of PD are you using (`pd-server -V`)?

[root@surrealdb-pd-0 /]# ./pd-server -V
Release Version: v8.2.0
Edition: Community
Git Commit Hash: c0ee2cd6c2eea7ad9372cc5bd00f6774abad6834
Git Branch: HEAD
UTC Build Time:  2024-07-04 09:39:38

The text was updated successfully, but these errors were encountered:

faelau added the type/bug The issue is confirmed as a bug. label Aug 19, 2024

jebter added the severity/major label Aug 30, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD is corrupting `etcd` database on restart #8547

PD is corrupting `etcd` database on restart #8547

faelau commented Aug 19, 2024

PD is corrupting etcd database on restart #8547

PD is corrupting etcd database on restart #8547

Comments

faelau commented Aug 19, 2024

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

PD is corrupting `etcd` database on restart #8547

PD is corrupting `etcd` database on restart #8547

What version of PD are you using (`pd-server -V`)?