Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast_create_table: add retry for local worker #52543

Merged
merged 3 commits into from
Jul 10, 2024

Conversation

GMHDBJD
Copy link
Contributor

@GMHDBJD GMHDBJD commented Apr 12, 2024

What problem does this PR solve?

Issue Number: ref #49752

Problem Summary:

What changed and how does it work?

  • add retry for retryable error
  • add id allocator to reduce write conflict

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2024
Copy link

tiprow bot commented Apr 12, 2024

Hi @GMHDBJD. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

"github.com/pingcap/tidb/pkg/meta"
)

const (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments to let user easy to know the func and parameters usage?

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Apr 15, 2024

/retest

Copy link

tiprow bot commented Apr 15, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

codecov bot commented Apr 15, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 56.2759%. Comparing base (49f1427) to head (91521cb).
Report is 19 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #52543         +/-   ##
=================================================
- Coverage   74.7101%   56.2759%   -18.4343%     
=================================================
  Files          1545       1669        +124     
  Lines        362047     612862     +250815     
=================================================
+ Hits         270486     344894      +74408     
- Misses        71923     244527     +172604     
- Partials      19638      23441       +3803     
Flag Coverage Δ
integration 37.2033% <0.0000%> (?)
unit 71.9354% <71.4285%> (-1.6858%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <ø> (-2.2339%) ⬇️
parser ∅ <ø> (∅)
br 52.6914% <ø> (+5.0099%) ⬆️

@@ -348,7 +348,7 @@ func TestProcessChunkWith(t *testing.T) {
require.Len(t, progress.GetColSize(), 3)
checksumMap := checksum.GetInnerChecksums()
require.Len(t, checksumMap, 1)
require.Equal(t, verify.MakeKVChecksum(111, 3, 14231358899564314836), *checksumMap[verify.DataKVGroupID])
require.Equal(t, verify.MakeKVChecksum(111, 3, 4697906519963456140), *checksumMap[verify.DataKVGroupID])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksum changed due to different table id

Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

tasks = newTasks
}
// TODO: since ticdc doesn't support batch create tables now, we disable until ticdc support it.
// if newTasks, err := d.combineBatchCreateTableJobs(tasks); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they can support it in the short term, feel free to leave it. If they don't support it for a long time, can we just delete it first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They said it has not been scheduled yet, and we hope to support CDC in 8.1, so disable it first.

var err error
for i := 0; i < maxRetryTime; i++ {
err = wk.HandleLocalDDLJob(d.ddlCtx, job)
if err == nil || !isRetryableError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that if the job itself is not finished (job.IsFinished is false), it will retry itself. Why is there a special emphasis here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for fast create table, the job does not exist in the ddl job table, so the original retry mechanism (taken out from the ddl job table and re-executed) will not take effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, could we add a comment here?

}

type allocator struct {
cache []int64
Copy link
Contributor

@zimulala zimulala Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache exists, if there are multiple tidbs in a cluster, is the following possible?

TiDB1: Execute DDL1, DDL2...DDL201 on table tblx. ,
TiDB2: Executes DDL302 (job.ID: 302) on table tbl;
After 5 minutes, DDL201 is not finished
TiDB1: Execute DDLn (job.ID: 202) on table tbl. ,

Then DDLn will start executing 5 minutes later than DDL302, but the actual execution will start and finish first.
There's no error problem, but it's weird. In addition, this ID will also be used for the ID of the table, which may also have similar problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's possible. See id_allocator_test.go. If multiple tidbs execute ddl at the same time, the order of job ids is no longer guaranteed. And there may be id gaps between two jobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the previous implementation also had similar problems. For example, the job that requested the ID later was executed for a long time before being committed. The order of the IDs was not guaranteed and had no impact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original ID order is not guaranteed, but the interval is relatively small. It doesn't seem to be a problem at the moment, but this feels kind of weird. table ID also has this problem. Does this give an optimized performance comparison? See if the optimization effect is obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is one aspect, but more is robustness. When multiple tidbs allocate IDs at the same time, it will cause a lot of write conflicts, and changing to pessimistic transactions will also affect performance. see #27197

Copy link
Contributor

@zimulala zimulala Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If you change step to 10, the cases I mentioned above are less.
  2. The occurrence probability of this scenario(#27197) you mentioned is relatively small. In addition, theoretically we have limitDDLJobs, so this kind of problem should not occur for a TiDB. But we use the same key for tableID and jobID, so a create table operation may result in multiple ID assignments.
    Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot. And, it does not affect the old logic (which considers the ID assignment to be relatively incremental).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we already have global id lock in the original logic, which means that there will no longer be a write conflict in the same tidb, but in multiple tidbs, this problem cannot be avoided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txn.SetOption(kv.Pessimistic, true) doesn't mean pessimistic transactions, we have to lock the required key explicitly, i tested lock + incr(global id) using 1000 routines, qps is about 1k

Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot

agree with this, we can consider caching id after this optimization, if this part is still worth optimize.

FYI i have a on-going pr separating local ddl jobs from normal ones in addBatchDDLJobs, and change job id allocation + insert into table run in one txn, to make sure job submitted in id order

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Apr 26, 2024

/retest

Copy link

tiprow bot commented Apr 26, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 1, 2024

/retest

Copy link

tiprow bot commented Jul 1, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 4, 2024

/retest

Copy link

tiprow bot commented Jul 4, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// lock to reduce conflict
d.globalIDLock.Lock()

ids, err = d.genGlobalIDs(len(tasks))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this optimize only works when we connects to multiple Nodes to do create-table

have you any data on how much this can improve performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

img_v3_02cg_769e1e9e-dd6e-40a5-9ea6-b5ac4596e8dg
to
img_v3_02cc_cb333fa6-74f9-41c4-b4c5-ff62c7249f9g

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 5, 2024

/retest

Copy link

tiprow bot commented Jul 5, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2024
@GMHDBJD GMHDBJD changed the title fast_create_table: add retry and id_allocator fast_create_table: add retry for local worker Jul 9, 2024
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 9, 2024

/retest

Copy link

tiprow bot commented Jul 9, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 9, 2024

/retest

Copy link

tiprow bot commented Jul 9, 2024

@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 9, 2024

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Jul 9, 2024
pkg/ddl/job_table.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 9, 2024
Copy link

ti-chi-bot bot commented Jul 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, lance6716, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 10, 2024
Copy link

ti-chi-bot bot commented Jul 10, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-07-09 14:24:39.501462366 +0000 UTC m=+367576.736696479: ☑️ agreed by tangenta.
  • 2024-07-10 06:35:24.019274486 +0000 UTC m=+425821.254508598: ☑️ agreed by D3Hunter.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 10, 2024

/retest

3 similar comments
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 10, 2024

/retest

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 10, 2024

/retest

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jul 10, 2024

/retest

@ti-chi-bot ti-chi-bot bot merged commit 4793f5e into pingcap:master Jul 10, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants