Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: don't use ClearRange point deletes with estimated MVCC stats #74674

Merged
merged 1 commit into from
Jan 13, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jan 11, 2022

ClearRange avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.

These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
ClearRange to submit a huge write batch, which gets rejected by Raft
with command too large.

This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.

Resolves #74686.

Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 11, 2022

@sumeerbhola Do you have an opinion on this assessment?

This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine.

@erikgrinaker erikgrinaker changed the title kvserver: don't use ClearRange point deletes with estimates MVCC stats kvserver: don't use ClearRange point deletes with estimated MVCC stats Jan 11, 2022
@dt
Copy link
Member

dt commented Jan 11, 2022

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 11, 2022

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

We limit the point delete path to 512 KB keys+values, so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB. I suppose we could keep a running tally and fall back to the range delete (rather than a resume span) if we find out that the batch actually exceeds the estimate, but do we need to? In that case, I think I'd instead prefer just to recompute stats if they're estimates.

@sumeerbhola
Copy link
Collaborator

@sumeerbhola Do you have an opinion on this assessment?

This sounds fine. I am assuming estimated stats are rare.

Even if we take the point delete path, wouldn't we prefer to return a resume span before we run into cmd-too-large?

The interface for ClearIterRange is not setup for this -- maybe a wider change would still be backportable, but doesn't seem necessary assuming ContainsEstimates is not lying.

so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB

I am curious about this "which we now ensure" -- can you point me to what changed?

@erikgrinaker
Copy link
Contributor Author

This sounds fine. I am assuming estimated stats are rare.

Not that rare. For example, index backfills always use estimated stats across all ranges, since ensuring accurate stats has a significant performance penalty. But it's probably still rare enough that it's unlikely that both 1) the amount of data written is <512 KB, and 2) the number of affected ranges is large enough that the range tombstones now become problematic.

so assuming accurate statistics (which we now ensure) there is no way this can exceed 512 KB

I am curious about this "which we now ensure" -- can you point me to what changed?

We ensure that we do not take the point deletion path if stats are estimated. We trust that stats are accurate when they are claimed to be, and I believe we have multiple assertions for this during e.g. race builds and tests.

`ClearRange` avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.

These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
`ClearRange` to submit a huge write batch, which gets rejected by Raft
with `command too large`.

This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.

Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.
@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Jan 13, 2022

Build succeeded:

@craig craig bot merged commit e3cbcf9 into cockroachdb:master Jan 13, 2022
tbg added a commit to tbg/cockroach that referenced this pull request Jan 24, 2022
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that
of the proposal. As of cockroachdb#71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from `sendWithRangeID`). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the `clearrange` suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of `clearrange` with cockroachdb#74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches cockroachdb#68303.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 25, 2022
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that
of the proposal. As of cockroachdb#71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from `sendWithRangeID`). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the `clearrange` suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of `clearrange` with cockroachdb#74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches cockroachdb#68303.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 25, 2022
75448: kvserver: use Background() in computeChecksumPostApply goroutine r=erikgrinaker a=tbg

On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that
of the proposal. As of #71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from `sendWithRangeID`). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the `clearrange` suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR #74674, `ClearRange` on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of `clearrange` with #74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches #68303.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
nicktrav pushed a commit that referenced this pull request Jan 28, 2022
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that
of the proposal. As of #71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from `sendWithRangeID`). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the `clearrange` suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR #74674, `ClearRange` on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of `clearrange` with #74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches #68303.

Release note: None
@erikgrinaker erikgrinaker deleted the clearrange-stats-estimates branch February 14, 2022 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: don't use ClearRange point deletes with estimated MVCC stats
5 participants