-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rollback changes made during PlannedReparent if we encounter errors #4550
Conversation
…rrors Signed-off-by: deepthi <[email protected]>
References #4481 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is heading in a good direction, but I'm admittedly not that familiar with the internals to provide much useful feedback on the actual implementation. Maybe @rafael or @demmer could provide better direction.
On the logical side of PlannedReparentShard
, I think it might be a good idea to add a check before demoting the master, to make it less likely to enter the rollback scenario you're covering here.
vitess/go/vt/wrangler/reparent.go
Line 415 in 16ca2c3
I don't know if this is worth a new flag or not, but it seems reasonable to me to say "if replication delay on the master-elect is greater than 10 seconds, error out early." @hmcgonig and @acharis at HubSpot are doing a similar check before they even run PlannedReparentShard
.
@@ -426,6 +427,10 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R | |||
event.DispatchUpdate(ev, "promoting slave") | |||
rp, err = wr.tmc.PromoteSlaveWhenCaughtUp(ctx, masterElectTabletInfo.Tablet, rp) | |||
if err != nil { | |||
// if this fails it is not enough to return an error. we should rollback all the changes made by DemoteMaster | |||
if err1 := wr.tmc.UndoDemoteMaster(ctx, oldMasterTabletInfo.Tablet); err1 != nil { | |||
log.Warningf("Encountered error %v while trying to undo DemoteMaster", err1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is encountered, I'm not sure where that leaves the state of the shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the previous call to DemoteMaster succeeded, then this should also succeed unless something changed in the meantime. if it does fail, then you are back in today's situation where PromoteSlave failed and the cluster is in an inconsistent state.
Signed-off-by: deepthi <[email protected]>
@sougou and I discussed this and decided that the operator should be able to specify their tolerance, so yeah, a new flag. |
go/vt/wrangler/reparent.go
Outdated
if err != nil { | ||
if err != nil || (ctx.Err() != nil && ctx.Err() == context.DeadlineExceeded) { | ||
// if this fails it is not enough to return an error. we should rollback all the changes made by DemoteMaster | ||
if err1 := wr.tmc.UndoDemoteMaster(ctx, oldMasterTabletInfo.Tablet); err1 != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work if the context has already expired. Is there a higher level context with longer timeout that we can use for this?
There should be one with the action timeout. Maybe use that, but with a shorter timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I missed that. No, there is no higher level context, this is it.
…-running operations during reparent. Deprecate lock_timeout and use this one instead for topo locking Signed-off-by: deepthi <[email protected]>
Fix #4481
Attempt to undo changes made during reparenting if we encounter a failure in a critical phase (DemoteMaster / PromoteSlave). Once those steps are complete we have reached a point of no return and cannot rollback.
Rather than separating the cancel logic into a function, the approach taken here is to rollback as and when we encounter an error. That way we know exactly which steps have to be rolled back, and there are many. The exception is DemoteMaster, for which an Undo function has been implemented.
One thing that hasn't been done yet is to check the new master's replication lag before going ahead with DemoteMaster. That will be done as a separate PR.
Signed-off-by: deepthi [email protected]