large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

apollodafoni · 2024-10-22T02:33:26Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tiup.yaml is as follows:

global:
  arch: amd64
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tiup/deploy"
  data_dir: "/tiup/data"
  enable_tls: false
server_configs:
  pd: {}
  tidb: {}
  tikv: {}
pd_servers:
  - host: pd-1-peer
  - host: pd-2-peer
  - host: pd-3-peer
tidb_servers:
  - host: tidb-1-peer
  - host: tidb-2-peer
  - host: tidb-3-peer
tikv_servers:
  - host: tikv-1-peer
  - host: tikv-2-peer
  - host: tikv-3-peer
monitoring_servers:
  - host: tiup-peer
    ng_port: 12020
grafana_servers:
  - host: tiup-peer
alertmanager_servers:
  - host: tiup-peer

tiup cluster deploy ddl_upgrade v8.3.0 tiup.yaml --format json -y
tiup cluster start ddl_upgrade --format json -y

After restore some data from S3, do large reorg:

alter table bill_detail add index idx1 (create_time, update_time, bill_code, order_code, assign_site_code, three_code, send_name, receive_name, send_mobile)

Duiring sql execution, do tiup upgrade:

tiup cluster upgrade ddl_upgrade v8.4.0-pre --wait-timeout 300 -y

2. What did you expect to see? (Required)

large reorg sql execute success after upgrade

3. What did you see instead (Required)

It seems like ddl job not paused duiring upgrade!

sql execute failed: Message: "receive Regions with no peer"

[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:536] [onError] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"] [stack="github.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).onError\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:536\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runSubtask\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:418\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:374\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).RunStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:256\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).Run\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:236\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*Manager).startTaskExecutor.func1\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/manager.go:337\ngithub.com/pingcap/tidb/pkg/util.(*WaitGroupWrapper).RunWithLog.func1\n\t/workspace/source/tidb/pkg/util/wait_group_wrapper.go:171"]
[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:542] ["taskExecutor met first error"] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"]

4. What is your TiDB version? (Required)

tiup upgrade tidb from v8.3.0 to v8.4.0-pre

The text was updated successfully, but these errors were encountered:

apollodafoni · 2024-10-22T02:34:50Z

/severity critical
/component ddl
/assign @tangenta
/label affects-8.4

lance6716 · 2024-10-22T05:48:23Z

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2400

The error is here, found by search in "org" level in GitHub. I guess the reason is PD ScanRegion API does not have internal retry and caller forgets to handle it like

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2218-L2225

lance6716 · 2024-10-22T05:52:34Z

Same cause as tikv/pd#8442

cfzjywxk · 2024-10-22T11:15:04Z

If the region information is loaded from the local disk and the current leader has not yet reported a heartbeat to PD, the region information scanned at this time will not include the leader.

The lighting has encountered similar issues before #52822.

Need to add retry logic when no leader region information is returned.

ref #56757

close #56757

apollodafoni added the type/bug The issue is confirmed as a bug. label Oct 22, 2024

ti-chi-bot bot assigned tangenta Oct 22, 2024

ti-chi-bot bot added severity/critical component/ddl This issue is related to DDL of TiDB. affects-8.4 may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Oct 22, 2024

lance6716 assigned cfzjywxk and unassigned tangenta Oct 22, 2024

cfzjywxk added the affects-8.1 label Oct 22, 2024

ti-chi-bot bot added may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Oct 22, 2024

This was referenced Oct 22, 2024

region cache: retry scan or batch scan regions when returned region has no leader tikv/client-go#1480

Merged

kv-client: update client-go to fix the retry issue #56860

Merged

cfzjywxk added affects-5.4 This bug affects 5.4.x versions. affects-6.1 and removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels Oct 28, 2024

cfzjywxk mentioned this issue Oct 28, 2024

kv-client: update client-go to fix the retry issue(#56860) #56862

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Oct 28, 2024

kv-client: update client-go to fix the retry issue(#56860) (#56862)

d330ab2

ref #56757

ti-chi-bot bot closed this as completed in #56860 Oct 29, 2024

ti-chi-bot bot pushed a commit that referenced this issue Oct 29, 2024

kv-client: update client-go to fix the retry issue (#56860)

3959803

close #56757

This was referenced Oct 29, 2024

kv-client: update client-go to fix the retry issue (#56860) #56941

Closed

kv-client: update client-go to fix the retry issue (#56860) #56946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

apollodafoni commented Oct 22, 2024 •

edited

Loading

apollodafoni commented Oct 22, 2024

lance6716 commented Oct 22, 2024 •

edited

Loading

lance6716 commented Oct 22, 2024

cfzjywxk commented Oct 22, 2024

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

Comments

apollodafoni commented Oct 22, 2024 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

apollodafoni commented Oct 22, 2024

lance6716 commented Oct 22, 2024 • edited Loading

lance6716 commented Oct 22, 2024

cfzjywxk commented Oct 22, 2024

apollodafoni commented Oct 22, 2024 •

edited

Loading

lance6716 commented Oct 22, 2024 •

edited

Loading