schemachange: attempting to update succeeded job over and over #38088

andreimatei · 2019-06-07T00:48:32Z

A customer cluster got all gunked up because a schema change (or a table truncation?) fails a couple of times a second with the following amusing message:

W190603 13:33:12.043023 211 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 456021744522723331: cannot update progress on succeeded job (id 456021744522723331)

Why someone is trying to update the progress of a succeeded job, I do not know. Two nodes racing on finishing the schema change maybe?

The schema change in question is:

456021744522723331 | SCHEMA CHANGE     | TRUNCATE TABLE <redacted> CASCADE

The table has id: 4191 and state: DROP and drop_job_id: 456021744522723331

These schema change retries kill us because, with every one, we seem to acquire and release the "schema change lease" for this table (I can see this by diffing consecutive versions of the descriptor) which eventually leads to the system config range being unable to accept writes because it's gotten too big and it can't be split.

Debug.zip here (internal only)

@dt you want this one?

The text was updated successfully, but these errors were encountered:

andreimatei · 2019-06-07T00:48:47Z

cc @roncrdb

dt · 2019-06-07T16:48:38Z

It seems like there's certainly something going wrong during truncation. We should fix that.
In this case, a job update error should probably be a logged warning but not bubble up, since the job is only written and maintained to show the user what's happening on the jobs UI -- the actual, correctness-controlling state is mostly in the table descriptor itself for now.

All of that said, I'm wary that this or other bugs can get the schema changer into such a hot loop: acquire lease, hit error, release lease. It seems like it might be a good idea to rate-limit the acquisition of a lease of a given ID per-node, to defend against cases, like this in the future. I'm not quite sure where we should do that since I don't really know where all the adoption stuff happens and if there is a long-lived, stateful place we could do that in, or if we'd be better off just making a var lastLeased map[ID]time.Time in pkg/sql and calling it a day, but either way, it seems like we want to make sure that a hot loop has something to keep it in check. I'll keep digging.

andreimatei · 2019-06-07T16:52:53Z

It has always seemed cray to me to retry schema changes indefinitely on any unrecognized error, as we do. I think we should change that policy - sometimes revert the schema change, sometime just pause it and require the admin to click a button.

And separately I think the schema change lease needs to get out of that descriptor.
And btw, my guess is that the attempt to update a finished job is a failure of this leasing mechanism, no?

dt · 2019-06-07T17:54:39Z

without our own reproduction, I'm wary of trying to fix this just by fixing the error itself since I'm not completely sure how we're getting there, but I do think it'd make sense to maybe delay the on-error release of the lease by, say, a minute.

andreimatei · 2019-06-17T20:55:32Z

So in the meantime there now appear to be 7 runaway schema changes on this cluster. I have a fresh dump of the data in the system range in question, and here's the counts for the descriptor versions.

   700902 /Table/3/1/4567/2/1
   485368 /Table/3/1/5065/2/1
   482198 /Table/3/1/5078/2/1
   476015 /Table/3/1/4548/2/1
   455540 /Table/3/1/4191/2/1
   374620 /Table/3/1/4544/2/1
   315203 /Table/3/1/4554/2/1

The range is now 4GB in size :) - we've disabled the write backpressure on this range. There appears to be something funky with the GC process too since it seems to be constantly 2 days behind the 1 day behind target; I'll track that separately.
I don't have new logs to verify that they're all failing in the way described for 4191 above.

@dt I think we should do what we can to find out what's going on with these jobs. The users here were willing to run custom binaries in the past. Internal communication with them in https://cockroachdb.zendesk.com/agent/tickets/3392

You interested in pulling on this? I'm interested in pulling on the GC angle.

dt · 2019-06-17T20:57:00Z

if needed, we could manually delete the descriptors from system.descriptor which would stop the schema changer from trying to mess with them

dt · 2019-06-17T21:07:39Z

hmm, I actually I guess even root doesn't have DELETE privs on system.descriptor.

andreimatei · 2019-06-17T21:20:42Z

As far as I'm concerned, I'm more interested in understanding the bug that causes the zombie schema changes rather than getting the system range back into a good shape.

irfansharif · 2019-10-17T20:55:21Z

Seeing this happen here.

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

45573: storage: trigger GC based on SysCount/SysBytes r=ajwerner a=tbg If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: #42765 3. aborting transactions in a busy loop: #38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in #44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable). Co-authored-by: Tobias Schottdorf <[email protected]>

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

thoszhang · 2020-04-30T21:34:26Z

Closing this now that schema changes run inside a job, making this error not possible anymore. We also changed this:

It has always seemed cray to me to retry schema changes indefinitely on any unrecognized error, as we do. I think we should change that policy - sometimes revert the schema change, sometime just pause it and require the admin to click a button.

We don't do this anymore; now if we fail, we try to roll back if applicable, but ultimately if we hit an error that's not one of the whitelisted "retriable" errors we'll fail the job permanently. Even for schema changes that are looping for some reason, you can now pause those jobs.

andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jun 7, 2019

andreimatei assigned dt Jun 7, 2019

andreimatei mentioned this issue Jun 7, 2019

sql: move schema change lease to schema change job. #34211

Closed

andreimatei mentioned this issue Jun 17, 2019

storage: range data is not being GCed in a timely manner #38251

Closed

roncrdb mentioned this issue Aug 6, 2019

Error executing schema change: failed to update job / does not exist #39255

Closed

tbg mentioned this issue Mar 2, 2020

storage: trigger GC based on SysCount/SysBytes #45573

Merged

tbg mentioned this issue Mar 5, 2020

backport-19.2: storage: trigger GC based on SysCount/SysBytes #45744

Merged

thoszhang closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schemachange: attempting to update succeeded job over and over #38088

schemachange: attempting to update succeeded job over and over #38088

andreimatei commented Jun 7, 2019 •

edited

Loading

andreimatei commented Jun 7, 2019

dt commented Jun 7, 2019

andreimatei commented Jun 7, 2019

dt commented Jun 7, 2019

andreimatei commented Jun 17, 2019 •

edited

Loading

dt commented Jun 17, 2019

dt commented Jun 17, 2019

andreimatei commented Jun 17, 2019

irfansharif commented Oct 17, 2019

thoszhang commented Apr 30, 2020

schemachange: attempting to update succeeded job over and over #38088

schemachange: attempting to update succeeded job over and over #38088

Comments

andreimatei commented Jun 7, 2019 • edited Loading

andreimatei commented Jun 7, 2019

dt commented Jun 7, 2019

andreimatei commented Jun 7, 2019

dt commented Jun 7, 2019

andreimatei commented Jun 17, 2019 • edited Loading

dt commented Jun 17, 2019

dt commented Jun 17, 2019

andreimatei commented Jun 17, 2019

irfansharif commented Oct 17, 2019

thoszhang commented Apr 30, 2020

andreimatei commented Jun 7, 2019 •

edited

Loading

andreimatei commented Jun 17, 2019 •

edited

Loading