-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiKV GcKeys task doesn't work when called with multiple keys (at least in 5.1 but I think for everything) #11217
Comments
Tagging @hicqu @Connor1996 @MyonKeminta from #9959 |
Nice catch! It does have the problem. Would you like to file a PR to fix it? |
Thanks for your reporting and digging so deep to find the cause.
Your suggested change should work, but I suggest adding the |
Oh, good point in making RegionInfoAccessor consistent. Will cut a PR |
Added PR - going to backport to 5.1.2 locally and test on our cluster. Could somebody on the PingCAP side take adding whatever tests you think appropriate? |
Posted this as a comment on the PR as well but as-written, doesn't work: Actually this has a problem - get_regions_in_range is also called from CompactionGuardGenerator, where we already have a data key. I think that function is also busted because it then calls data_end_key(®ion.end_key) where I think region end_key is already data encoded? This is starting to seem like it really needs an actual implementation in the type system for data encoded keys vs. non data encoded keys or something and I'm not convinced I have enough context to pursue. |
Okay, here's a version (with logging as well) that's working locally: https://github.com/frew/tikv/pull/1/files Ran into two additional issues:
|
|
The problem is the other way - self.smallest_key and self.largest_key appear to already have the 'z' prefix, so following @MyonKeminta 's suggestion of making the function prepend internally doesn't work. |
okay, I’ll add a test and fix it |
@Connor1996 okay, the current version (+logging) which seems to be working when applied to our 5.1.2 cluster is at: https://github.com/frew/tikv/pull/1/files Please let me know (+ feel free to ping me in Slack) if you have any questions. The logging is probably more verbose than you'd want to merge into mainline, but I think everything else is reasonable to merge. One thing I haven't checked is whether the other users of |
(deleted one comment that misunderstood something) One proposal to make code simpler: Connor1996#2 Other than that, looks good to me as far as I understand it. |
Signed-off-by: Connor1996 <[email protected]>
* fix gc keys doesn't work Signed-off-by: Connor1996 <[email protected]> * close #11217 simplify rangekey Signed-off-by: Connor1996 <[email protected]> * improve test Signed-off-by: Connor1996 <[email protected]> * fix clippy Signed-off-by: Connor1996 <[email protected]> * add metrics Signed-off-by: Connor1996 <[email protected]> * use closed interval for end key Signed-off-by: Connor1996 <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
…#11248) * fix gc keys doesn't work Signed-off-by: Connor1996 <[email protected]> * close tikv#11217 simplify rangekey Signed-off-by: Connor1996 <[email protected]> * improve test Signed-off-by: Connor1996 <[email protected]> * fix clippy Signed-off-by: Connor1996 <[email protected]> * add metrics Signed-off-by: Connor1996 <[email protected]> * use closed interval for end key Signed-off-by: Connor1996 <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]> Signed-off-by: 5kbpers <[email protected]>
Bug Report
In gc worker, https://github.com/tikv/tikv/blob/master/src/server/gc_worker/gc_worker.rs#L340 the GcKeys GC task when called with multiple keys tries to get the list of regions overlapping the key range for the task and then do a sorted merge of the keys and the space encompassed by the regions. However, it fails to prepend the keys with the data prefix ('z') when passing them to get_regions_in_range, get_regions_in_range doesn't prepend the keys, and so the keys form an invalid range and no regions are ever returned. This makes garbage collection of delete markers not happen.
To verify the issue I added some logging statements in a branch: https://github.com/frew/tikv/pull/1/files#diff-fad0fef4b49a4159243096a9212032ae40e8b88cd0deb3c60df42a3dbdc639dfR385 - an example line from the first logging statement showing the issue in our cluster:
(note the prefix 122 on the first_region but not on the start_key)
I suggest adding data_key() to https://github.com/tikv/tikv/blob/master/src/server/gc_worker/gc_worker.rs#L338 and the following line, but defer to your expertise.
What version of TiKV are you using?
5.1.2
What operating system and CPU are you using?
Linux on GCP n2d
The text was updated successfully, but these errors were encountered: