Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schemachange: attempting to update succeeded job over and over #38088

Closed
andreimatei opened this issue Jun 7, 2019 · 10 comments
Closed

schemachange: attempting to update succeeded job over and over #38088

andreimatei opened this issue Jun 7, 2019 · 10 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Jun 7, 2019

A customer cluster got all gunked up because a schema change (or a table truncation?) fails a couple of times a second with the following amusing message:

W190603 13:33:12.043023 211 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 456021744522723331: cannot update progress on succeeded job (id 456021744522723331)

Why someone is trying to update the progress of a succeeded job, I do not know. Two nodes racing on finishing the schema change maybe?

The schema change in question is:

456021744522723331 | SCHEMA CHANGE     | TRUNCATE TABLE <redacted> CASCADE

The table has id: 4191 and state: DROP and drop_job_id: 456021744522723331

These schema change retries kill us because, with every one, we seem to acquire and release the "schema change lease" for this table (I can see this by diffing consecutive versions of the descriptor) which eventually leads to the system config range being unable to accept writes because it's gotten too big and it can't be split.

Debug.zip here (internal only)

@dt you want this one?

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jun 7, 2019
@andreimatei
Copy link
Contributor Author

cc @roncrdb

@dt
Copy link
Member

dt commented Jun 7, 2019

It seems like there's certainly something going wrong during truncation. We should fix that.
In this case, a job update error should probably be a logged warning but not bubble up, since the job is only written and maintained to show the user what's happening on the jobs UI -- the actual, correctness-controlling state is mostly in the table descriptor itself for now.

All of that said, I'm wary that this or other bugs can get the schema changer into such a hot loop: acquire lease, hit error, release lease. It seems like it might be a good idea to rate-limit the acquisition of a lease of a given ID per-node, to defend against cases, like this in the future. I'm not quite sure where we should do that since I don't really know where all the adoption stuff happens and if there is a long-lived, stateful place we could do that in, or if we'd be better off just making a var lastLeased map[ID]time.Time in pkg/sql and calling it a day, but either way, it seems like we want to make sure that a hot loop has something to keep it in check. I'll keep digging.

@andreimatei
Copy link
Contributor Author

It has always seemed cray to me to retry schema changes indefinitely on any unrecognized error, as we do. I think we should change that policy - sometimes revert the schema change, sometime just pause it and require the admin to click a button.

And separately I think the schema change lease needs to get out of that descriptor.
And btw, my guess is that the attempt to update a finished job is a failure of this leasing mechanism, no?

@dt
Copy link
Member

dt commented Jun 7, 2019

without our own reproduction, I'm wary of trying to fix this just by fixing the error itself since I'm not completely sure how we're getting there, but I do think it'd make sense to maybe delay the on-error release of the lease by, say, a minute.

@andreimatei
Copy link
Contributor Author

andreimatei commented Jun 17, 2019

So in the meantime there now appear to be 7 runaway schema changes on this cluster. I have a fresh dump of the data in the system range in question, and here's the counts for the descriptor versions.

   700902 /Table/3/1/4567/2/1
   485368 /Table/3/1/5065/2/1
   482198 /Table/3/1/5078/2/1
   476015 /Table/3/1/4548/2/1
   455540 /Table/3/1/4191/2/1
   374620 /Table/3/1/4544/2/1
   315203 /Table/3/1/4554/2/1

The range is now 4GB in size :) - we've disabled the write backpressure on this range. There appears to be something funky with the GC process too since it seems to be constantly 2 days behind the 1 day behind target; I'll track that separately.
I don't have new logs to verify that they're all failing in the way described for 4191 above.

@dt I think we should do what we can to find out what's going on with these jobs. The users here were willing to run custom binaries in the past. Internal communication with them in https://cockroachdb.zendesk.com/agent/tickets/3392

You interested in pulling on this? I'm interested in pulling on the GC angle.

@dt
Copy link
Member

dt commented Jun 17, 2019

if needed, we could manually delete the descriptors from system.descriptor which would stop the schema changer from trying to mess with them

@dt
Copy link
Member

dt commented Jun 17, 2019

hmm, I actually I guess even root doesn't have DELETE privs on system.descriptor.

@andreimatei
Copy link
Contributor Author

As far as I'm concerned, I'm more interested in understanding the bug that causes the zombie schema changes rather than getting the system range back into a good shape.

@irfansharif
Copy link
Contributor

Seeing this happen here.

tbg added a commit to tbg/cockroach that referenced this issue Mar 2, 2020
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

1. there's no trigger for GC based on abort span size alone (before
   this commit)
2. transaction aborts tended to create unnecessary abort span entries,
   fixed (and 19.2-backported) in:
   cockroachdb#42765
3. aborting transactions in a busy loop:
   cockroachdb#38088
   (and we suspect this also happens in user apps occasionally)
4. large snapshots would never complete due to the queue time limits
   (addressed in cockroachdb#44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).
tbg added a commit to tbg/cockroach that referenced this issue Mar 4, 2020
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

1. there's no trigger for GC based on abort span size alone (before
   this commit)
2. transaction aborts tended to create unnecessary abort span entries,
   fixed (and 19.2-backported) in:
   cockroachdb#42765
3. aborting transactions in a busy loop:
   cockroachdb#38088
   (and we suspect this also happens in user apps occasionally)
4. large snapshots would never complete due to the queue time limits
   (addressed in cockroachdb#44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).
tbg added a commit to tbg/cockroach that referenced this issue Mar 5, 2020
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

1. there's no trigger for GC based on abort span size alone (before
   this commit)
2. transaction aborts tended to create unnecessary abort span entries,
   fixed (and 19.2-backported) in:
   cockroachdb#42765
3. aborting transactions in a busy loop:
   cockroachdb#38088
   (and we suspect this also happens in user apps occasionally)
4. large snapshots would never complete due to the queue time limits
   (addressed in cockroachdb#44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).
craig bot pushed a commit that referenced this issue Mar 5, 2020
45573: storage: trigger GC based on SysCount/SysBytes r=ajwerner a=tbg

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

1. there's no trigger for GC based on abort span size alone (before
   this commit)
2. transaction aborts tended to create unnecessary abort span entries,
   fixed (and 19.2-backported) in:
   #42765
3. aborting transactions in a busy loop:
   #38088
   (and we suspect this also happens in user apps occasionally)
4. large snapshots would never complete due to the queue time limits
   (addressed in #44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).

Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Mar 5, 2020
If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

1. there's no trigger for GC based on abort span size alone (before
   this commit)
2. transaction aborts tended to create unnecessary abort span entries,
   fixed (and 19.2-backported) in:
   cockroachdb#42765
3. aborting transactions in a busy loop:
   cockroachdb#38088
   (and we suspect this also happens in user apps occasionally)
4. large snapshots would never complete due to the queue time limits
   (addressed in cockroachdb#44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).
@thoszhang
Copy link
Contributor

Closing this now that schema changes run inside a job, making this error not possible anymore. We also changed this:

It has always seemed cray to me to retry schema changes indefinitely on any unrecognized error, as we do. I think we should change that policy - sometimes revert the schema change, sometime just pause it and require the admin to click a button.

We don't do this anymore; now if we fail, we try to roll back if applicable, but ultimately if we hit an error that's not one of the whitelisted "retriable" errors we'll fail the job permanently. Even for schema changes that are looping for some reason, you can now pause those jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants