-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddl: fix a caught panic and add comment for DDL functions #54685
Conversation
Hi @lance6716. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: lance6716 <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #54685 +/- ##
=================================================
- Coverage 74.6249% 56.2786% -18.3464%
=================================================
Files 1551 1673 +122
Lines 362640 614388 +251748
=================================================
+ Hits 270620 345769 +75149
- Misses 72390 245222 +172832
- Partials 19630 23397 +3767
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/retest |
@lance6716: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
pkg/ddl/ddl_worker.go
Outdated
@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) { | |||
return txn, err | |||
} | |||
|
|||
func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) { | |||
// runOneJobStep runs one step of the DDL job and persist the states change. One | |||
// *step* is defined as the following reason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other step: 1. reorg also has it's own state change, 2. reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI shows the step of onLockTables
is not a schema state change. LOCK TABLES may have many table arguments and each step it updates one table's lock state and persist new TableInfo. I'll come up with a better comment tomorrow, and align the code to fix UT 😂 Maybe I should keep the old needUpdateRawArgs
behaviour (if no runErr
we should marshal RawArgs), and fix wrong runErr
nilness to close #54687
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time
please add this one too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated some function comments. f782a52
reorg is a bit complex as a breif example in function comments, I only added onLockTables. Please check if it's clear enough.
pkg/ddl/ddl_worker.go
Outdated
// - We may need to use caller `runOneJobStepAndWaitSync` to make sure other node | ||
// is synchronized before change the job state. So an extra job state *step* is | ||
// added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the wait change above runOneJobStep? it's for failover,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the wait change above runOneJobStep. This item wants to describe the job state changes, for example from JobStateDone to JobStateSynced.
I'm not sure why JobStateDone -> JobStateSynced is relevant to failover. My understanding is current node (the user connected node) must wait all other nodes for finishing synchronize job state before it tells the user DDL is finished, otherwise it breaks the linearizability (the user connected node shows DDL is finished, but the slow node shows DDL is not finished)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
->synced
might not go through the wait change above runOneJobStep
, wait
part might have done in below waitSchemaChanged
JobStateDone -> JobStateSynced is relevant to failover.
suppose owner change during wait, with this state change, we can catch it and wait again on new owner, i.e. the the wait change above runOneJobStep
pkg/ddl/ddl_worker.go
Outdated
// correctness through failover, this function will decide and persist the | ||
// arguments of a job as a separate *step*. These steps will reuse "schema state" | ||
// changes, see onRecoverTable as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you include done->sync in this part? it's not for failover, it must be done in a separate step after wait version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No "done->sync" is included in the item of your above comment.
This part is like the reason in the comments of onRecoverTable. If the job will change system states (like disable GC) and revert it afterward, if must save the GC state before get running. Otherwise if the job fails between changing states and reverting states, we can't recover the original states.
Maybe the other examples are, job should decided the TS and persist it first. Otherwise if it runs twice and choose two TS due to node crash, some problems will occur.
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
pkg/ddl/ddl_worker.go
Outdated
// but they make use of caller transitOneJobStep to persist job changes. | ||
// | ||
// - We may need to use caller transitOneJobStepAndWaitSync to make sure all | ||
// other node is synchronized to provide linearizability. So an extra job state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add an example, if i am a new comer, i will ask what is extra job state
pkg/ddl/ddl_worker.go
Outdated
@@ -899,7 +907,27 @@ func (w *worker) prepareTxn(job *model.Job) (kv.Transaction, error) { | |||
return txn, err | |||
} | |||
|
|||
func (w *worker) HandleDDLJobTable(d *ddlCtx, job *model.Job) (int64, error) { | |||
// runOneJobStep runs one step of the DDL job and persist the states change. One | |||
// *step* is defined as the following reason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reorg runs in async, it will exter/exist this function to check whether the async routine is done, there is no state change during this time
please add this one too
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
updateRawArgs = err == nil | ||
// if job changed from running to rolling back, arguments may be changed | ||
if prevState == model.JobStateRunning && job.IsRollingback() { | ||
updateRawArgs = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move runtime args change in something like job ctx, we actively mark it if those args changed and fill back depends on the mark
now we are checking whether args are changed passively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I started in the active way yesterday, and found there are too many cases and many function signatures need to be changed 😂
Save it in job
seems can avoid many changes, I'll try it in this PR or future ones. It's OK if this PR is merged before I finished the development. Don't need to hold this PR.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: D3Hunter, tangenta The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@lance6716: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #54687 ref #54436
Problem Summary:
What changed and how does it work?
updateRawArgs
as a return value ofrunOneJobStep
, to decide if RawArgs needs to be updated more accurateCheck List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.