Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb panic after inject network loss chaos repeatly #33265

Closed
mayjiang0203 opened this issue Mar 21, 2022 · 10 comments · Fixed by tikv/client-go#453 or #33290
Closed

tidb panic after inject network loss chaos repeatly #33265

mayjiang0203 opened this issue Mar 21, 2022 · 10 comments · Fixed by tikv/client-go#453 or #33290
Assignees
Labels
affects-6.0 found/automation Found by automation tests severity/critical type/bug The issue is confirmed as a bug.

Comments

@mayjiang0203
Copy link

mayjiang0203 commented Mar 21, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

running test case oltp_rel_008_004 in test plan endless-oltp-tpcc-large-raft-engine-rel@main.

Run workload, then inject network loss chaos to gc leader repeatly as following,

[2022/03/18 00:43:17.831 +08:00] [INFO] [chaos.go:358] ["fault will last for"] [duration=2m36s]
[2022/03/18 00:43:17.836 +08:00] [INFO] [chaos.go:86] ["Run chaos"] [name="gc leader network loss"] [selectors="[endless-oltp-tps-661868-1-536/tc-tidb-1]"] [experiment="{"Duration":"","Scheduler":null,"Loss":"75","Correlation":"50"}"]
[2022/03/18 00:45:53.861 +08:00] [INFO] [chaos.go:151] ["Clean chaos"] [name="gc leader network loss"] [chaosId="ns=endless-oltp-tps-661868-1-536,kind=network-loss,name=network-loss-jnmydpgf,spec=&k8s.ChaosIdentifier{Namespace:"endless-oltp-tps-661868-1-536", Name:"network-loss-jnmydpgf", Spec:NetworkLossSpec{Duration: "", Scheduler: , Loss: "75", Correlation: "50"}}"]

[2022/03/18 00:46:08.872 +08:00] [INFO] [chaos.go:358] ["fault will last for"] [duration=3m31s]
[2022/03/18 00:46:08.877 +08:00] [INFO] [chaos.go:86] ["Run chaos"] [name="gc leader network loss"] [selectors="[endless-oltp-tps-661868-1-536/tc-tidb-1]"] [experiment="{"Duration":"","Scheduler":null,"Loss":"75","Correlation":"50"}"]
[2022/03/18 00:49:39.891 +08:00] [INFO] [chaos.go:151] ["Clean chaos"] [name="gc leader network loss"] [chaosId="ns=endless-oltp-tps-661868-1-536,kind=network-loss,name=network-loss-lfgkopjb,spec=&k8s.ChaosIdentifier{Namespace:"endless-oltp-tps-661868-1-536", Name:"network-loss-lfgkopjb", Spec:NetworkLossSpec{Duration: "", Scheduler: , Loss: "75", Correlation: "50"}}"]
[2022/03/18 00:50:53.903 +08:00] [INFO] [chaos.go:358] ["fault will last for"] [duration=2m29s]
[2022/03/18 00:50:53.908 +08:00] [INFO] [chaos.go:86] ["Run chaos"] [name="gc leader network loss"] [selectors="[endless-oltp-tps-661868-1-536/tc-tidb-1]"] [experiment="{"Duration":"","Scheduler":null,"Loss":"75","Correlation":"50"}"]
[2022/03/18 00:53:22.921 +08:00] [INFO] [chaos.go:151] ["Clean chaos"] [name="gc leader network loss"] [chaosId="ns=endless-oltp-tps-661868-1-536,kind=network-loss,name=network-loss-wqmyzxbr,spec=&k8s.ChaosIdentifier{Namespace:"endless-oltp-tps-661868-1-536", Name:"network-loss-wqmyzxbr", Spec:NetworkLossSpec{Duration: "", Scheduler: , Loss: "75", Correlation: "50"}}"]

2. What did you expect to see? (Required)

No panic or oom occur.

3. What did you see instead (Required)

Tidb panic.
image

4. What is your TiDB version? (Required)

[2022/03/17 23:28:30.616 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=tikv] [version=6.0.0-alpha] [git_hash=819ac9d64f22eb346764329a30cdeac3570e6cec]
[2022/03/17 23:28:30.616 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=pd] [version=6.0.0-nightly] [git_hash=e278c6c3d83087001843a596834fd2eb080ad281]
[2022/03/17 23:28:30.616 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=tidb] [version=6.0.0-nightly] [git_hash=d5867b1dba5bea3433ebe6b9eb17ba63bb6e3e74]

@mayjiang0203 mayjiang0203 added the type/bug The issue is confirmed as a bug. label Mar 21, 2022
@mayjiang0203
Copy link
Author

mayjiang0203 commented Mar 21, 2022

/type bug
/found automation
/severity critical

@ti-chi-bot ti-chi-bot added the found/automation Found by automation tests label Mar 21, 2022
@cfzjywxk cfzjywxk self-assigned this Mar 21, 2022
@ti-chi-bot ti-chi-bot added severity/critical may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 labels Mar 21, 2022
@cfzjywxk cfzjywxk removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. labels Mar 21, 2022
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 21, 2022

It seems the panic is caused by a nil longest sleep config is returned and then is ued.

I think it's caused by that Backoffer.Fork does not inherit the configs and backoffSleepMS, the forked backoffer is used but correspond configs could not be found.

/cc @youjiali1995 @sticnarf

@youjiali1995
Copy link
Contributor

Let me see see

@cfzjywxk
Copy link
Contributor

@mayjiang0203
As the panic is not recoverd by tidb-server and this would make tidb-server crash, critical severity level is used and it's a blocking issue for v6.0.

@youjiali1995
Copy link
Contributor

I think it's caused by that Backoffer.Fork does not inherit the configs and backoffSleepMS, the forked backoffer is used but correspond configs could not be found.

Agree. Who will be responsible for solving this issue? Are you working on it? @cfzjywxk

@cfzjywxk
Copy link
Contributor

I think it's caused by that Backoffer.Fork does not inherit the configs and backoffSleepMS, the forked backoffer is used but correspond configs could not be found.

Agree. Who will be responsible for solving this issue? Are you working on it? @cfzjywxk

@youjiali1995
Not yet, do you have time to fix it?

@youjiali1995
Copy link
Contributor

yes

@youjiali1995
Copy link
Contributor

youjiali1995 commented Mar 21, 2022

If an expired backoffer is cloned or forked, the cloned or forked one will panic on next time backoff.

	// TestBackoffErrorTypeWithForkAndClone, see https://github.com/pingcap/tidb/issues/33265
	{
		b := NewBackofferWithVars(context.TODO(), 200, nil)
		// 700 ms sleep in total and the backoffer will return an error next time.
		for i := 0; i < 3; i++ {
			err = b.Backoff(BoMaxDataNotReady, errors.New("data not ready"))
			assert.Nil(t, err)
		}
		bForked, cancel := b.Fork()
		defer cancel()
		bCloned := b.Clone()
		for _, b := range []*Backoffer{bForked, bCloned} {
			err = b.Backoff(BoTiKVRPC, errors.New("tikv rpc"))
			assert.ErrorIs(t, err, BoMaxDataNotReady.err)
		}
	}

Fork() and Clone() should deep copy slices and maps to avoid data race, now we shallow copy the error slice which exhibits data race https://github.com/tikv/client-go/blob/33936824f6107da1b468edd1464ad3a5a7671bde/internal/retry/backoff.go#L230 and we don't do it for backoffSleepMS so the longestSleepCfg of forked or cloned one is not accurate. I'm thinking should we deep copy all fields to make it accurate or deep copy nothing for speed. @cfzjywxk @sticnarf

BTW, my vim is broken. I have to fix it ASAP. Could you work on it? @sticnarf

@cfzjywxk
Copy link
Contributor

@youjiali1995
I think deep copy is ok and should have no side effects, I could work on it.

@sticnarf
Copy link
Contributor

Looking at the use cases of Backoffer, these fields may be implemented using persistent data structures supporting fast and safe Copy-on-Write opeartions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.0 found/automation Found by automation tests severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
6 participants