Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

Closed
apollodafoni opened this issue Oct 22, 2024 · 4 comments · Fixed by #56860
Closed

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

apollodafoni opened this issue Oct 22, 2024 · 4 comments · Fixed by #56860

Comments

@apollodafoni
Copy link

apollodafoni commented Oct 22, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tiup.yaml is as follows:

global:
  arch: amd64
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tiup/deploy"
  data_dir: "/tiup/data"
  enable_tls: false
server_configs:
  pd: {}
  tidb: {}
  tikv: {}
pd_servers:
  - host: pd-1-peer
  - host: pd-2-peer
  - host: pd-3-peer
tidb_servers:
  - host: tidb-1-peer
  - host: tidb-2-peer
  - host: tidb-3-peer
tikv_servers:
  - host: tikv-1-peer
  - host: tikv-2-peer
  - host: tikv-3-peer
monitoring_servers:
  - host: tiup-peer
    ng_port: 12020
grafana_servers:
  - host: tiup-peer
alertmanager_servers:
  - host: tiup-peer
tiup cluster deploy ddl_upgrade v8.3.0 tiup.yaml --format json -y
tiup cluster start ddl_upgrade --format json -y

After restore some data from S3, do large reorg:

alter table bill_detail add index idx1 (create_time, update_time, bill_code, order_code, assign_site_code, three_code, send_name, receive_name, send_mobile)

Duiring sql execution, do tiup upgrade:

tiup cluster upgrade ddl_upgrade v8.4.0-pre --wait-timeout 300 -y

2. What did you expect to see? (Required)

large reorg sql execute success after upgrade

3. What did you see instead (Required)

It seems like ddl job not paused duiring upgrade!

sql execute failed: Message: "receive Regions with no peer"

[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:536] [onError] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"] [stack="github.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).onError\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:536\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runSubtask\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:418\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:374\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).RunStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:256\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).Run\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:236\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*Manager).startTaskExecutor.func1\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/manager.go:337\ngithub.com/pingcap/tidb/pkg/util.(*WaitGroupWrapper).RunWithLog.func1\n\t/workspace/source/tidb/pkg/util/wait_group_wrapper.go:171"]
[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:542] ["taskExecutor met first error"] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"]

4. What is your TiDB version? (Required)

tiup upgrade tidb from v8.3.0 to v8.4.0-pre

@apollodafoni apollodafoni added the type/bug The issue is confirmed as a bug. label Oct 22, 2024
@apollodafoni
Copy link
Author

/severity critical
/component ddl
/assign @tangenta
/label affects-8.4

@lance6716
Copy link
Contributor

lance6716 commented Oct 22, 2024

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2400

The error is here, found by search in "org" level in GitHub. I guess the reason is PD ScanRegion API does not have internal retry and caller forgets to handle it like

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2218-L2225

@lance6716
Copy link
Contributor

Same cause as tikv/pd#8442

@lance6716 lance6716 assigned cfzjywxk and unassigned tangenta Oct 22, 2024
@cfzjywxk
Copy link
Contributor

If the region information is loaded from the local disk and the current leader has not yet reported a heartbeat to PD, the region information scanned at this time will not include the leader.

The lighting has encountered similar issues before #52822.

Need to add retry logic when no leader region information is returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants