fast_create_table: add retry for local worker #52543

GMHDBJD · 2024-04-12T06:12:32Z

What problem does this PR solve?

Issue Number: ref #49752

Problem Summary:

What changed and how does it work?

add retry for retryable error
add id allocator to reduce write conflict

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

tiprow · 2024-04-12T06:12:46Z

Hi @GMHDBJD. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Benjamin2037 · 2024-04-15T07:52:12Z

pkg/ddl/id_allocator.go

+	"github.com/pingcap/tidb/pkg/meta"
+)
+
+const (


Can you add some comments to let user easy to know the func and parameters usage?

GMHDBJD · 2024-04-15T08:29:59Z

/retest

tiprow · 2024-04-15T08:30:27Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov · 2024-04-15T08:35:55Z

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 56.2759%. Comparing base (49f1427) to head (91521cb).
Report is 19 commits behind head on master.

Additional details and impacted files

@@                Coverage Diff                @@
##             master     #52543         +/-   ##
=================================================
- Coverage   74.7101%   56.2759%   -18.4343%     
=================================================
  Files          1545       1669        +124     
  Lines        362047     612862     +250815     
=================================================
+ Hits         270486     344894      +74408     
- Misses        71923     244527     +172604     
- Partials      19638      23441       +3803

Flag	Coverage Δ
integration	`37.2033% <0.0000%> (?)`
unit	`71.9354% <71.4285%> (-1.6858%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.9656% <ø> (-2.2339%)`	⬇️
parser	`∅ <ø> (∅)`
br	`52.6914% <ø> (+5.0099%)`	⬆️

GMHDBJD · 2024-04-17T09:03:28Z

pkg/executor/importer/importer_testkit_test.go

@@ -348,7 +348,7 @@ func TestProcessChunkWith(t *testing.T) {
 		require.Len(t, progress.GetColSize(), 3)
 		checksumMap := checksum.GetInnerChecksums()
 		require.Len(t, checksumMap, 1)
-		require.Equal(t, verify.MakeKVChecksum(111, 3, 14231358899564314836), *checksumMap[verify.DataKVGroupID])
+		require.Equal(t, verify.MakeKVChecksum(111, 3, 4697906519963456140), *checksumMap[verify.DataKVGroupID])


checksum changed due to different table id

lance6716

/approve

zimulala · 2024-04-18T03:44:14Z

pkg/ddl/ddl_worker.go

-		tasks = newTasks
-	}
+	// TODO: since ticdc doesn't support batch create tables now, we disable until ticdc support it.
+	// if newTasks, err := d.combineBatchCreateTableJobs(tasks); err == nil {


If they can support it in the short term, feel free to leave it. If they don't support it for a long time, can we just delete it first?

They said it has not been scheduled yet, and we hope to support CDC in 8.1, so disable it first.

zimulala · 2024-04-18T04:11:17Z

pkg/ddl/job_table.go

+		var err error
+		for i := 0; i < maxRetryTime; i++ {
+			err = wk.HandleLocalDDLJob(d.ddlCtx, job)
+			if err == nil || !isRetryableError(err) {


I remember that if the job itself is not finished (job.IsFinished is false), it will retry itself. Why is there a special emphasis here?

for fast create table, the job does not exist in the ddl job table, so the original retry mechanism (taken out from the ddl job table and re-executed) will not take effect.

Got it, could we add a comment here?

zimulala · 2024-04-18T07:49:48Z

pkg/ddl/id_allocator.go

+}
+
+type allocator struct {
+	cache []int64


If the cache exists, if there are multiple tidbs in a cluster, is the following possible?

TiDB1: Execute DDL1, DDL2...DDL201 on table tblx. , TiDB2: Executes DDL302 (job.ID: 302) on table tbl; After 5 minutes, DDL201 is not finished TiDB1: Execute DDLn (job.ID: 202) on table tbl. ,

Then DDLn will start executing 5 minutes later than DDL302, but the actual execution will start and finish first.
There's no error problem, but it's weird. In addition, this ID will also be used for the ID of the table, which may also have similar problems.

Yes, that's possible. See id_allocator_test.go. If multiple tidbs execute ddl at the same time, the order of job ids is no longer guaranteed. And there may be id gaps between two jobs

However, the previous implementation also had similar problems. For example, the job that requested the ID later was executed for a long time before being committed. The order of the IDs was not guaranteed and had no impact.

The original ID order is not guaranteed, but the interval is relatively small. It doesn't seem to be a problem at the moment, but this feels kind of weird. table ID also has this problem. Does this give an optimized performance comparison? See if the optimization effect is obvious.

Performance is one aspect, but more is robustness. When multiple tidbs allocate IDs at the same time, it will cause a lot of write conflicts, and changing to pessimistic transactions will also affect performance. see #27197

If you change step to 10, the cases I mentioned above are less.

The occurrence probability of this scenario(#27197) you mentioned is relatively small. In addition, theoretically we have limitDDLJobs, so this kind of problem should not occur for a TiDB. But we use the same key for tableID and jobID, so a create table operation may result in multiple ID assignments.
Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot. And, it does not affect the old logic (which considers the ID assignment to be relatively incremental).

In fact, we already have global id lock in the original logic, which means that there will no longer be a write conflict in the same tidb, but in multiple tidbs, this problem cannot be avoided.

txn.SetOption(kv.Pessimistic, true) doesn't mean pessimistic transactions, we have to lock the required key explicitly, i tested lock + incr(global id) using 1000 routines, qps is about 1k

Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot

agree with this, we can consider caching id after this optimization, if this part is still worth optimize.

FYI i have a on-going pr separating local ddl jobs from normal ones in addBatchDDLJobs, and change job id allocation + insert into table run in one txn, to make sure job submitted in id order

GMHDBJD · 2024-04-26T06:10:26Z

/retest

tiprow · 2024-04-26T06:10:49Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

GMHDBJD · 2024-07-01T05:36:39Z

/retest

tiprow · 2024-07-01T05:37:00Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

GMHDBJD · 2024-07-04T16:03:58Z

/retest

tiprow · 2024-07-04T16:04:25Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

D3Hunter · 2024-07-05T02:27:53Z

pkg/ddl/ddl_worker.go

-	// lock to reduce conflict
-	d.globalIDLock.Lock()
+
+	ids, err = d.genGlobalIDs(len(tasks))


seems this optimize only works when we connects to multiple Nodes to do create-table

have you any data on how much this can improve performance?

GMHDBJD · 2024-07-05T07:18:40Z

/retest

tiprow · 2024-07-05T07:19:02Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

GMHDBJD · 2024-07-09T09:22:57Z

/retest

tiprow · 2024-07-09T09:23:20Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

GMHDBJD · 2024-07-09T12:01:59Z

/retest

tiprow · 2024-07-09T12:02:20Z

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

GMHDBJD · 2024-07-09T13:47:37Z

/ok-to-test

pkg/ddl/job_table.go

ti-chi-bot · 2024-07-10T06:35:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, lance6716, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/ddl/OWNERS~~ [tangenta]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-07-10T06:35:24Z

[LGTM Timeline notifier]

Timeline:

2024-07-09 14:24:39.501462366 +0000 UTC m=+367576.736696479: ☑️ agreed by tangenta.
2024-07-10 06:35:24.019274486 +0000 UTC m=+425821.254508598: ☑️ agreed by D3Hunter.

GMHDBJD · 2024-07-10T07:09:05Z

/retest

GMHDBJD · 2024-07-10T09:37:31Z

/retest

GMHDBJD · 2024-07-10T14:20:15Z

/retest

GMHDBJD · 2024-07-10T14:55:26Z

/retest

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2024

Benjamin2037 reviewed Apr 15, 2024

View reviewed changes

GMHDBJD commented Apr 17, 2024

View reviewed changes

lance6716 reviewed Apr 17, 2024

View reviewed changes

zimulala reviewed Apr 18, 2024

View reviewed changes

D3Hunter reviewed Jul 5, 2024

View reviewed changes

revert id allocator

979cb87

GMHDBJD force-pushed the fastCreateTableEnhance branch from f8af8a5 to 979cb87 Compare July 9, 2024 07:54

ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2024

GMHDBJD changed the title ~~fast_create_table: add retry and id_allocator~~ fast_create_table: add retry for local worker Jul 9, 2024

ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Jul 9, 2024

tangenta reviewed Jul 9, 2024

View reviewed changes

pkg/ddl/job_table.go Outdated Show resolved Hide resolved

address comment

231a88d

tangenta approved these changes Jul 9, 2024

View reviewed changes

ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 9, 2024

fix

91521cb

D3Hunter approved these changes Jul 10, 2024

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 10, 2024

ti-chi-bot bot merged commit 4793f5e into pingcap:master Jul 10, 2024
23 checks passed

fast_create_table: add retry for local worker #52543

fast_create_table: add retry for local worker #52543

Conversation

GMHDBJD commented Apr 12, 2024

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

tiprow bot commented Apr 12, 2024

Choose a reason for hiding this comment

GMHDBJD commented Apr 15, 2024

tiprow bot commented Apr 15, 2024

codecov bot commented Apr 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

lance6716 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zimulala Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zimulala Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GMHDBJD commented Apr 26, 2024

tiprow bot commented Apr 26, 2024

GMHDBJD commented Jul 1, 2024

tiprow bot commented Jul 1, 2024

GMHDBJD commented Jul 4, 2024

tiprow bot commented Jul 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GMHDBJD commented Jul 5, 2024

tiprow bot commented Jul 5, 2024

GMHDBJD commented Jul 9, 2024

tiprow bot commented Jul 9, 2024

GMHDBJD commented Jul 9, 2024

tiprow bot commented Jul 9, 2024

GMHDBJD commented Jul 9, 2024

ti-chi-bot bot commented Jul 10, 2024

ti-chi-bot bot commented Jul 10, 2024

[LGTM Timeline notifier]

GMHDBJD commented Jul 10, 2024

GMHDBJD commented Jul 10, 2024

GMHDBJD commented Jul 10, 2024

GMHDBJD commented Jul 10, 2024

codecov bot commented Apr 15, 2024 •

edited

Loading

zimulala Apr 18, 2024 •

edited

Loading

zimulala Apr 24, 2024 •

edited

Loading