Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rollback changes made during PlannedReparent if we encounter errors #4550

Merged
merged 3 commits into from
Feb 22, 2019

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Jan 25, 2019

Fix #4481
Attempt to undo changes made during reparenting if we encounter a failure in a critical phase (DemoteMaster / PromoteSlave). Once those steps are complete we have reached a point of no return and cannot rollback.
Rather than separating the cancel logic into a function, the approach taken here is to rollback as and when we encounter an error. That way we know exactly which steps have to be rolled back, and there are many. The exception is DemoteMaster, for which an Undo function has been implemented.
One thing that hasn't been done yet is to check the new master's replication lag before going ahead with DemoteMaster. That will be done as a separate PR.

Signed-off-by: deepthi [email protected]

@deepthi deepthi requested a review from sougou as a code owner January 25, 2019 01:10
@deepthi deepthi changed the title WIP #4481 rollback changes made during reparent if we encounter errors WIP #4481 rollback changes made during PlannedReparent if we encounter errors Jan 25, 2019
@derekperkins
Copy link
Member

References #4481

Copy link
Member

@derekperkins derekperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is heading in a good direction, but I'm admittedly not that familiar with the internals to provide much useful feedback on the actual implementation. Maybe @rafael or @demmer could provide better direction.

On the logical side of PlannedReparentShard, I think it might be a good idea to add a check before demoting the master, to make it less likely to enter the rollback scenario you're covering here.

I don't know if this is worth a new flag or not, but it seems reasonable to me to say "if replication delay on the master-elect is greater than 10 seconds, error out early." @hmcgonig and @acharis at HubSpot are doing a similar check before they even run PlannedReparentShard.

@@ -426,6 +427,10 @@ func (wr *Wrangler) plannedReparentShardLocked(ctx context.Context, ev *events.R
event.DispatchUpdate(ev, "promoting slave")
rp, err = wr.tmc.PromoteSlaveWhenCaughtUp(ctx, masterElectTabletInfo.Tablet, rp)
if err != nil {
// if this fails it is not enough to return an error. we should rollback all the changes made by DemoteMaster
if err1 := wr.tmc.UndoDemoteMaster(ctx, oldMasterTabletInfo.Tablet); err1 != nil {
log.Warningf("Encountered error %v while trying to undo DemoteMaster", err1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is encountered, I'm not sure where that leaves the state of the shard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the previous call to DemoteMaster succeeded, then this should also succeed unless something changed in the meantime. if it does fail, then you are back in today's situation where PromoteSlave failed and the cluster is in an inconsistent state.

@deepthi deepthi changed the title WIP #4481 rollback changes made during PlannedReparent if we encounter errors Fix #4481 rollback changes made during PlannedReparent if we encounter errors Jan 29, 2019
@deepthi
Copy link
Member Author

deepthi commented Feb 15, 2019

I don't know if this is worth a new flag or not, but it seems reasonable to me to say "if replication delay on the master-elect is greater than 10 seconds, error out early." @hmcgonig and @acharis at HubSpot are doing a similar check before they even run PlannedReparentShard.

@sougou and I discussed this and decided that the operator should be able to specify their tolerance, so yeah, a new flag.

@deepthi deepthi changed the title Fix #4481 rollback changes made during PlannedReparent if we encounter errors rollback changes made during PlannedReparent if we encounter errors Feb 15, 2019
if err != nil {
if err != nil || (ctx.Err() != nil && ctx.Err() == context.DeadlineExceeded) {
// if this fails it is not enough to return an error. we should rollback all the changes made by DemoteMaster
if err1 := wr.tmc.UndoDemoteMaster(ctx, oldMasterTabletInfo.Tablet); err1 != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work if the context has already expired. Is there a higher level context with longer timeout that we can use for this?

There should be one with the action timeout. Maybe use that, but with a shorter timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I missed that. No, there is no higher level context, this is it.

…-running operations during reparent. Deprecate lock_timeout and use this one instead for topo locking

Signed-off-by: deepthi <[email protected]>
@sougou sougou merged commit a6f5da6 into vitessio:master Feb 22, 2019
@harshit-gangal harshit-gangal deleted the ds-reparent-undo branch July 17, 2020 16:39
systay pushed a commit that referenced this pull request Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants