ddl: fix a caught panic and add comment for DDL functions #54685

lance6716 · 2024-07-17T09:15:08Z

What problem does this PR solve?

Issue Number: close #54687 ref #54436

Problem Summary:

What changed and how does it work?

add comments for some DDL functions
add updateRawArgs as a return value of runOneJobStep, to decide if RawArgs needs to be updated more accurate

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

tiprow · 2024-07-17T09:15:42Z

Hi @lance6716. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: lance6716 <[email protected]>

codecov · 2024-07-17T10:01:45Z

Codecov Report

Attention: Patch coverage is 85.29412% with 10 lines in your changes missing coverage. Please review.

Project coverage is 56.2786%. Comparing base (2108661) to head (8a1c803).
Report is 54 commits behind head on master.

Additional details and impacted files

@@                Coverage Diff                @@
##             master     #54685         +/-   ##
=================================================
- Coverage   74.6249%   56.2786%   -18.3464%     
=================================================
  Files          1551       1673        +122     
  Lines        362640     614388     +251748     
=================================================
+ Hits         270620     345769      +75149     
- Misses        72390     245222     +172832     
- Partials      19630      23397       +3767

Flag	Coverage Δ
integration	`37.1281% <64.7058%> (?)`
unit	`71.7111% <83.8235%> (-1.8255%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.9656% <ø> (-2.2339%)`	⬇️
parser	`∅ <ø> (∅)`
br	`52.4680% <ø> (+4.8364%)`	⬆️

lance6716 · 2024-07-17T10:18:07Z

/retest

tiprow · 2024-07-17T10:18:28Z

@lance6716: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

D3Hunter · 2024-07-17T10:21:20Z

pkg/ddl/ddl_worker.go

@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) {
 	return txn, err
 }

-func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) {
+// runOneJobStep runs one step of the DDL job and persist the states change. One
+// *step* is defined as the following reason:


other step: 1. reorg also has it's own state change, 2. reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

CI shows the step of onLockTables is not a schema state change. LOCK TABLES may have many table arguments and each step it updates one table's lock state and persist new TableInfo. I'll come up with a better comment tomorrow, and align the code to fix UT 😂 Maybe I should keep the old needUpdateRawArgs behaviour (if no runErr we should marshal RawArgs), and fix wrong runErr nilness to close #54687

reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

please add this one too

updated some function comments. f782a52

reorg is a bit complex as a breif example in function comments, I only added onLockTables. Please check if it's clear enough.

D3Hunter · 2024-07-17T10:23:22Z

pkg/ddl/ddl_worker.go

+// - We may need to use caller `runOneJobStepAndWaitSync` to make sure other node
+// is synchronized before change the job state. So an extra job state *step* is
+// added.


you mean the wait change above runOneJobStep? it's for failover,

Yes the wait change above runOneJobStep. This item wants to describe the job state changes, for example from JobStateDone to JobStateSynced.

I'm not sure why JobStateDone -> JobStateSynced is relevant to failover. My understanding is current node (the user connected node) must wait all other nodes for finishing synchronize job state before it tells the user DDL is finished, otherwise it breaks the linearizability (the user connected node shows DDL is finished, but the slow node shows DDL is not finished)

done->synced might not go through the wait change above runOneJobStep, wait part might have done in below waitSchemaChanged

JobStateDone -> JobStateSynced is relevant to failover.

suppose owner change during wait, with this state change, we can catch it and wait again on new owner, i.e. the the wait change above runOneJobStep

D3Hunter · 2024-07-17T10:25:57Z

pkg/ddl/ddl_worker.go

+// correctness through failover, this function will decide and persist the
+// arguments of a job as a separate *step*. These steps will reuse "schema state"
+// changes, see onRecoverTable as an example.


you include done->sync in this part? it's not for failover, it must be done in a separate step after wait version.

No "done->sync" is included in the item of your above comment.

This part is like the reason in the comments of onRecoverTable. If the job will change system states (like disable GC) and revert it afterward, if must save the GC state before get running. Otherwise if the job fails between changing states and reverting states, we can't recover the original states.

Maybe the other examples are, job should decided the TS and persist it first. Otherwise if it runs twice and choose two TS due to node crash, some problems will occur.

Signed-off-by: lance6716 <[email protected]>

D3Hunter · 2024-07-18T03:00:08Z

pkg/ddl/ddl_worker.go

+// but they make use of caller transitOneJobStep to persist job changes.
+//
+// - We may need to use caller transitOneJobStepAndWaitSync to make sure all
+// other node is synchronized to provide linearizability. So an extra job state


please add an example, if i am a new comer, i will ask what is extra job state

D3Hunter · 2024-07-18T03:06:21Z

pkg/ddl/ddl_worker.go

@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) {
 	return txn, err
 }

-func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) {
+// runOneJobStep runs one step of the DDL job and persist the states change. One
+// *step* is defined as the following reason:


reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time

please add this one too

Signed-off-by: lance6716 <[email protected]>

D3Hunter · 2024-07-18T10:16:10Z

pkg/ddl/ddl_worker.go

+	updateRawArgs = err == nil
+	// if job changed from running to rolling back, arguments may be changed
+	if prevState == model.JobStateRunning && job.IsRollingback() {
+		updateRawArgs = true


we should move runtime args change in something like job ctx, we actively mark it if those args changed and fill back depends on the mark

now we are checking whether args are changed passively.

Yes I started in the active way yesterday, and found there are too many cases and many function signatures need to be changed 😂

Save it in job seems can avoid many changes, I'll try it in this PR or future ones. It's OK if this PR is merged before I finished the development. Don't need to hold this PR.

ti-chi-bot · 2024-07-23T04:08:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [D3Hunter,tangenta]
~~pkg/ddl/OWNERS~~ [tangenta]
~~pkg/parser/OWNERS~~ [D3Hunter,tangenta]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-07-23T04:08:39Z

[LGTM Timeline notifier]

Timeline:

2024-07-18 08:47:06.53689772 +0000 UTC m=+516448.527839228: ☑️ agreed by D3Hunter.
2024-07-23 04:08:38.948885337 +0000 UTC m=+931740.939826807: ☑️ agreed by tangenta.

tiprow · 2024-07-23T04:16:06Z

@lance6716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
fast_test_tiprow	`8a1c803`	link	true	`/test fast_test_tiprow`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

--wip-- [skip ci]

3693a4b

remove a sanity check

ac5e101

Signed-off-by: lance6716 <[email protected]>

lance6716 changed the title ~~[WIP]ddl: fix a caught panic~~ ddl: fix a caught panic and add comment for DDL functions Jul 17, 2024

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2024

ti-chi-bot bot removed do-not-merge/needs-tests-checked do-not-merge/needs-linked-issue labels Jul 17, 2024

D3Hunter reviewed Jul 17, 2024

View reviewed changes

D3Hunter mentioned this pull request Jul 17, 2024

ddl code refactor/optimize #54436

Open

54 tasks

lance6716 changed the title ~~ddl: fix a caught panic and add comment for DDL functions~~ [WIP]ddl: fix a caught panic and add comment for DDL functions Jul 17, 2024

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 17, 2024

lance6716 added 3 commits July 18, 2024 00:09

update some comment

cd546dd

Signed-off-by: lance6716 <[email protected]>

add updateRawArgs return value

e3fb415

Signed-off-by: lance6716 <[email protected]>

add one more special case

75e2b4f

Signed-off-by: lance6716 <[email protected]>

D3Hunter reviewed Jul 18, 2024

View reviewed changes

lance6716 added 2 commits July 18, 2024 11:25

fix UT

53fe2f9

Signed-off-by: lance6716 <[email protected]>

update comment

f782a52

Signed-off-by: lance6716 <[email protected]>

lance6716 changed the title ~~[WIP]ddl: fix a caught panic and add comment for DDL functions~~ ddl: fix a caught panic and add comment for DDL functions Jul 18, 2024

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2024

remove a TODO

8a1c803

Signed-off-by: lance6716 <[email protected]>

D3Hunter approved these changes Jul 18, 2024

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jul 18, 2024

D3Hunter reviewed Jul 18, 2024

View reviewed changes

tangenta approved these changes Jul 23, 2024

View reviewed changes

ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jul 23, 2024

ti-chi-bot bot merged commit f774ef6 into pingcap:master Jul 23, 2024
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: fix a caught panic and add comment for DDL functions #54685

ddl: fix a caught panic and add comment for DDL functions #54685

lance6716 commented Jul 17, 2024 •

edited

Loading

tiprow bot commented Jul 17, 2024

codecov bot commented Jul 17, 2024 •

edited

Loading

lance6716 commented Jul 17, 2024

tiprow bot commented Jul 17, 2024

D3Hunter Jul 17, 2024

lance6716 Jul 17, 2024 •

edited

Loading

D3Hunter Jul 18, 2024

lance6716 Jul 18, 2024

D3Hunter Jul 17, 2024

lance6716 Jul 17, 2024 •

edited

Loading

D3Hunter Jul 18, 2024

D3Hunter Jul 17, 2024

lance6716 Jul 17, 2024 •

edited

Loading

D3Hunter Jul 18, 2024

D3Hunter Jul 18, 2024

D3Hunter Jul 18, 2024

lance6716 Jul 18, 2024 •

edited

Loading

ti-chi-bot bot commented Jul 23, 2024

ti-chi-bot bot commented Jul 23, 2024

tiprow bot commented Jul 23, 2024

ddl: fix a caught panic and add comment for DDL functions #54685

ddl: fix a caught panic and add comment for DDL functions #54685

Conversation

lance6716 commented Jul 17, 2024 • edited Loading

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

tiprow bot commented Jul 17, 2024

codecov bot commented Jul 17, 2024 • edited Loading

Codecov Report

lance6716 commented Jul 17, 2024

tiprow bot commented Jul 17, 2024

Choose a reason for hiding this comment

lance6716 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lance6716 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lance6716 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lance6716 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

ti-chi-bot bot commented Jul 23, 2024

ti-chi-bot bot commented Jul 23, 2024

[LGTM Timeline notifier]

tiprow bot commented Jul 23, 2024

lance6716 commented Jul 17, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

lance6716 Jul 17, 2024 •

edited

Loading

lance6716 Jul 17, 2024 •

edited

Loading

lance6716 Jul 17, 2024 •

edited

Loading

lance6716 Jul 18, 2024 •

edited

Loading