-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvprober: implement "shadow write" probes #67112
Comments
Here you go:
Just some boilerplate, but not much in terms of technical challenges. I think we don't even need to add a new command though, it feels as though adding a new key like Lines 398 to 402 in a7472e3
(with suffix "prbe") would ~get the job done. It could then also be used as the key read by the read prober. You can write that key through the existing KV API, so very little plumbing needed. |
That would be awesome. In addition to being easier to implement, then we actually write to pebble also. Re: "range-local", I am reading: Line 118 in a7472e3
Gonna need to read that a few times more... How do you guarantee the range local key doesn't conflict with some other key, e.g. a key where we store SQL data? |
I found: Line 93 in 4d987b3
Which seems to answer my question. Reading. |
This is basically a key that is guaranteed to not have any function yet. It's in a "parallel plane" relative to the SQL data or anything else the tenant might write. (I think it isn't even allowed to). |
Ack. Thanks. Will take a closer look soon. This seems like a great direction. |
OK, so pattern matching a bit... I add a new prefix here, maybe "prb" or "probe": Line 213 in 4d987b3
Then I add a function like this, maybe Line 398 in 4d987b3
Then I call it from Line 1099 in 4d987b3
And I pass the returned key into methods like Do I have that right? Now things get very fuzzy for me... gonna ask Qs that may be obvious to help me learn... I presume Line 510 in 4d987b3
I also presume that at a storage level, the key is written with the range local prefix as per below?
I am confused how range merges & splits work. A change in range start & end key means a change in the key at which a range descriptor is stored, since part of the key is the range start key, right? Do we move these KV pairs at range merge & split time somehow? Would we need to do that with the prober KV pairs, or would Let's say there are two ranges:
These ranges get merged into a range A that starts at key "a" and ends at key "e". Planning ran before the merge happened returning range B with start key "c". So we try probing I guess it is also possible that the read side of the prober would read a key before it has been written by the write side of the prober, but reading an empty key is not an error, so that is okay. |
👋🏽
four chars please, which is why I suggested
Yes, this is all done transparently. For example here cockroach/pkg/kv/kvserver/replica_command.go Lines 2624 to 2648 in 8238ce0
The start key of a range is immutable while the range exists (splits never change the start key, and merges extend the end key, i.e. the right hand side stops existing). So it's possible that you would probe r1, then r1 subsumes r2, but you still probe r2 (which then writes an extra key in the middle of r1 that wouldn't be anyone's responsibility to clean up). This is mildly annoying, but something you can do here is to not actually write anything! There are two options:
I think I prefer the second option. The first one has this mysterious read (which you need to block the txn on replicating the previous write; otherwise it'll sneakily not wait for the write to complete before you abort the txn, being none the wiser that it didn't go through). The second one avoids that bit of complexity and actually commits a txn, which is nice. I'd have to check whether it actually leaves a tombstone at the MVCC layer but it doesn't really matter, it definitely blocks on replication. |
Hi, Tobias!
Very fun. Yay to transactions.
Ack yes I think this is what I was thinking about. Would GC clean up this key eventually or does GC not operate on these range local key pairs? Not saying that is a good idea to rely on GC; more wondering if GC does operate on this part of the keyspace.
Would be nice if a write was made to storage IMO (that is, would be nice if a tombstone was left). A bit more coverage that way? Or maybe given details not so important? |
GC never cleans up live values, this value would be live so GC would never touch it (it is otherwise a regular value to GC, i.e. no special casing). It would be around "forever". So, good to avoid this scenario via the txn option.
Not so important I think. What matters is a full trip through the replication layer (leaving an intent, which also does all of the writing to storage that exists), reading that back, and making it through committing the txn record. Option 2) achieves all of the above. |
Oh and btw, this is also a full read probe already, i.e. unless you somehow want to read probe much more aggressively than write probe, I think this could easily be the only kind of probe and that would be totally ok. |
Nice! All makes sense. This is awesome. Thanks for the help as always. |
I'm thinking more about this. My gut says it's nice to have separate measurements of read & write availability, but I am not entirely figuring out why I feel that way. Perhaps part of it is that two probe types may provide useful data during an incident. If read is green but write is red, that means a different set of potential prod issues then if read & write are red. Similarly, when writing postmortems, it may be nice to be able to speak about both read & write availability. Perhaps the possibility of sending a higher rate of reads than writes per second bit is a piece of the puzzle also. Lastly, I think we also want a scan at historical timestamp prober, as that will provide a good production vet of pebble especially. So the Any thoughts, Tobias? |
These are reasonable points, except this one:
This will exercise follower reads, but has no particular coverage of pebble. To probe pebble a) you want to know what node you're talking to, which kvprober doesn't - it operates above the abstraction of "range") and b) you want to do something better than ask pebble for that one key! You'd pick a random SST from the manifest and seek somewhere into it, or something like that. Also, re: "there will be multiple things we want to probe for so kvprober should bake in reads and writes", wouldn't it be the opposite? We don't want an ever-proliferating set of cluster settings and knobs, do we? Or at least we wouldn't want to commit to that too early. |
Will open a separate issue with both you & storage on it, but I'm thinking the scan prober would scan the whole range at a historical timestamp and export metrics re: the bandwidth that live data (at the historical timestamp) was scanned at. I'd expect a big enough dip in that bandwidth indicates a production issue, and the root cause would often be pebble (e.g. an "inverted" LSM) or else infra issues (e.g. issue with disk leading to lower perf than expected). Does that seem wrong? I hear you saying we can get a better production vet of pebble by probing based on knowledge of internal details of pebble ("pick SST" (a concept not exposed by either the Agree? Disagree? Half disagree / half agree?
Ya, you're right. My thinking got a little twisted there. We only want the complexity of multiple probe types when it's worth the costs. |
Will open a separate issue with both you & storage on it, but I'm
thinking the scan prober would scan the whole range at a historical
timestamp and export metrics re: the bandwidth that live data (at the
historical timestamp) was scanned at. I'd expect a big enough dip in that
bandwidth indicates a production issue, and the root cause would often be
pebble (e.g. an "inverted" LSM) or else infra issues (e.g. disk bandwidth
issues). Does that seem wrong?
It doesn't seem right. If you're going to probe pebble you want to probe
pebble. With a historical read through KV you will be going through KV and
so that's what you're probing (you will also be much more likely to
interfere with user traffic, etc). The point is less about internals vs
non-internals, it's about direct access to the subsystem you want to
extract information from. I'm all for probing pebble, but I don't think to
go through KV for that is right, similar to how you decided against probing
KV from SQL by issuing random SELECTs. The straw man canonical solution is
adding an endpoint here
https://github.com/cockroachdb/cockroach/blob/40029d161a803b0ee7d105ec737c0a205571a0b9/pkg/kv/kvserver/storage_services.proto#L30-L32
and the implementation does whatever the storage folks thing a good probe
is, against that store's `storage.Engine`:
https://github.com/cockroachdb/cockroach/blob/fe57f71323d06512bab5efe7c80a98163982e4b2/pkg/storage/engine.go#L681-L782
A separate issue for this makes sense, but I imagine that this probe would
try to sync an invisible write via
`eng.NewBatch().LogData([]byte("pebble-probe")` (and committing that
batch), and there is probably also a read flavor to it, though with just
the interface, you won't find anything interesting to read. The LSM is
really the "range descriptor" at the pebble level so by analogy it does
make a ton of sense to pick a few random SSTs and try to access them.
I'm fairly lukewarm on the "scan the range and compute throughput" idea, we
want to avoid adding work to the system as a result of probes (in
particular accessing large sets of data will impact stats, user traffic,
compactions, everything basically), and it's hard to form expectations
about what a good throughput is since that depends heavily on the system.
If a probe starts looking too much like it's trying to be an alerting rule,
something is wrong.
|
Ack. Will digest this comment a bit, think more in shower :), & start more discussion in another place. Thanks, Tobias. |
I am working on this now. |
Just put up a PR that adds a range-local key dedicated to blackbox probing: #68645. I have a hacked POC of the "shadow write" functionality, and it appears to be working:
I will turn this into a proper PR now. I will leave read probes intact as per the above discussion about value of having both. |
66893: cli,storage: add emergency ballast r=jbowens a=jbowens Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See #66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. Also, add a new Disk Full (10) exit code that indicates that the node exited because disk space on at least one store is exhausted. On node start, if any store has less than half the ballast's size bytes available, the node immediately exits with the Disk Full (10) exit code. The operator may manually remove the configured ballast (assuming they haven't already) to allow the node to start, and they can take action to remedy the disk space exhaustion. The ballast will automatically be recreated when available disk space is 4x the ballast size, or at least 10 GiB is available after the ballast is created. 68645: keys/kvprober: introduce a range-local key for probing, use from kvprober r=tbg a=joshimhoff This work sets the stage for extending `kvprober` to do writes as is discussed in detail with @tbg at #67112. **keys: add a range-local key for probing** This commit introduces a range-local key for probing. The key will only be used by probing components like kvprober. This means no contention with user-traffic or other CRDB components. This key also provides a safe place to write to in order to test write availabilty. A kvprober that does writes is coming soon. Release note: None. **kvprober: probe the range-local key dedicated to probing** Before this commit, kvprober probed the start key of a range. This worked okay, as kvprober only did reads, and contention issues leading to false positive pages haven't happened in practice. But contention issues are possible, as there may be data located at the start key of the range. With this commit, kvprober probes the range-local key dedicated to probing. No contention issues are possible, as that key is only for probing. This key is also needed for write probes, which are coming soon. Release note: None. 69164: Revert "backupccl: protect entire keyspan during cluster backup" r=dt a=adityamaru This reverts commit 1b5fd4f. The commit above laid a pts record over the entire table keyspace. This did not account for two things (with the potential of there being more): 1. System tables that we do not backup could have a short GC TTL, and so incremental backups that attempt to protect from `StartTime` of the previous backup would fail. 2. Dropped tables often have a short GC TTL to clear data once they have been dropped. This change would also attempt to protect "dropped but not gc'ed tables" even though we exclude them from the backup, and fail on pts verification. One suggested approach is to exclude all objects we do not backup by subtracting these spans from {TableDataMin, TableDataMax}. This works for system tables, and dropped but not gc'ed tables, but breaks for dropped and gc'ed tables. A pts verification would still find the leaseholder of the empty span and attempt to protect below the gc threshold. In conclusion, we need to think about the semantics a little more before we rush to protect a single key span. Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Aditya Maru <[email protected]>
69035: kvprober: extend kvprober to test KV's write codepaths r=joshimhoff a=joshimhoff Design discussed in detail at #67112. **kvprober: extend kvprober to test KV's write codepaths** Before this commit, kvprober did point reads of range-local keys dedicated to probing. kvprober did not exercise KV's write codepaths. With this commit, kvprober exercises KV's write codepaths. kvprober has two probe loops, one that does point reads & another that commits a txn that puts and deletes a key, leaving no live data but exercising the write codepaths. Release justification: Improvement to observability used only by CC SREs. Release note: None. 69041: sqlproxyccl: use tokenbucket to throttle connection attempts r=JeffSwenson a=JeffSwenson Previously, the sqlproxy used exponential backoff to reject connection attempts if the source ip previously made an invalid connection attempt. A map was used to track previously succesful (source ip, destination cluster) pairs. The throttle was only applied to unsuccesful pairs. Now a token bucket is used to throttle connections. If a connection is succesful, the token is returned to the bucket. This should avoid throttling well behaved users and provide tools to throttle abuse. The ConnectionCache was moved into the throttle service in order to group the throttling logic into a single module. The general intent of the change is to allow more connection attempts before throttling kicks in. The main issue with the old approach is: 1. Testers would regularly lock themselves out by configuring an incorrect password. The max backoff was a too aggressive and testers would get stuck waiting minutes to hours for the throttle to end. 2. It's unclear how to handle racing connection attempts. Currently all connection attempts before the first success trigger the throttling code path. Using a token bucket naturally allows us to bound the total number of attempts the system lets through. Release note: None 69043: sql: implement `GENERATED ... AS IDENTITY` functionality for `INSERT/UPDATE/UPSERT` r=ZhouXing19 a=ZhouXing19 This commit is to implement the functionality of GENERATED {ALWAYS | BY DEFAULT} AS IDENTITY for INSERT/UPDATE/UPSERT statement. In PostgreSQL, if a column is created with GENERATED ALWAYS AS IDENTITY token, 1. it can only be updated to DEFAULT; 2. when executing INSERT on this column, the conflict cannot be resolved by `ON CONFLICT` statement; 3. it cannot be written explicitly; (i.e. cannot INSERT/UPSERT without the OVERRIDING SYSTEM VALUE token); 4. it is implicitly NOT NULL, but we must INSERT/UPSERT without specifying the value for it. If a column is created with `GENERATED BY DEFAULT AS IDENTITY` token, the above restrictions do not apply to it. Tests for optbuilder support is added in pkg/sql/opt/optbuilder/testdata. This commit also adds a type restriction for GENERATED {ALWAYS | BY DEFAULT} AS IDENTITY syntax under CREATE TABLE -- this column can only be of smallint, integer, or bigint type. This matches the PostgreSQL syntax: https://www.postgresql.org/docs/current/sql-createtable.html Release note: None Release justification: although this supports a new feature, the changes are high-benefit and have been under scrutiny for many weeks now, so after discussing with other leads we feel fine about merging this during stability period. 69296: build,ccl: quash a few outdated references to `libroach` r=rail a=rickystewart `libroach` doesn't exist any more. Release justification: Non-production code change Release note: None 69313: workload: reintroduce old `prepare` behavior r=ajwerner a=rafiss This reverts commit f446498, and makes the 'prepare' setting work more closely to how it used to. This is needed because we are seeing that the schemachange workload is deallocating huge amounts of automatically prepared statements. By explicitly preparing statements again, they should no longer be deallocated. This also disables the automatic prepared statement cache in pgx entirely, which was enabled in jackc/pgx@0c3e59b Release justification: test-only change Release note: None 69341: ccl/backupccl: make TestBackupRestoreTenant faster r=adityamaru a=sajjadrizvi TestBackupRestoreTenant/restore-tenant10-to-latest is slow due to a long adopt interval. This commit uses a testing knob to shorten test time. Release justification: low risk modification to reduce test time Release note: None Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Jeff <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Sajjad Rizvi <[email protected]>
Is your feature request related to a problem? Please describe.
We have a
kvprober
that sends point read requests to "random" ranges. We should extent that prober to test the availability of a range at a write level. We can call this a "shadow write".Describe the solution you'd like
Strawman proposal:
Probe
/ShadowWrite
and make available via thekvclient
public API.kvprober
to makeProbe
/ShadowWrite
requests to "random" ranges.The test of
kv
is decent. TheProbe
/ShadowWrite
command needs to get proposed, agreed upon, applied, etc. (Am I using these words, correctly?) A write to the raft log will happen, so availability of the disk is checked.The test of
pebble
is minimal, as no actual write happens atProbe
command apply time. Note though that we could change this in future CRDB versions. One can imagine writing to pebble but in a way that doesn't lead to user-visible side effects, in order to improve the realistic of the probe (in order to match the actual CRDB write codepath more closely).CC @tbg @andreimatei @knz @bdarnell @jreut @logston for review of the strawman proposal. I hope for a naming bikeshed.
Also, KV folks: How hard of a time do you think I will have implementing this? It's hard for me to scope the add
Probe
/ShadowWrite
command part. My sense from talking with Ben a while back is that it's not technically hard really but lots of boilerplate and also a new command hasn't been added in a while so may be tricky to figure out all the places to make changes.Describe alternatives you've considered
These aren't really alternatives tho. Blackbox approaches like this one are complimented by whitebox approaches.
Additional context
#61074
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go
Epic CC-4054
The text was updated successfully, but these errors were encountered: