Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and cleanup treatment of keyspace IDs and KeyRange #12524

Merged
merged 11 commits into from
Mar 24, 2023

Conversation

jeremycole
Copy link
Contributor

@jeremycole jeremycole commented Mar 1, 2023

Description

Fixes #12535

Initially we encountered a bug where unevenly split shards with a set of key ranges as (... 000280-000300, 0003-) did not properly initialize shard 000280-000300 due to believing it overlapped with 0003-. (This was due to a comparison where 000300 should have been equal to 0003 but was not.)

In the course of fixing this bug, I decided to refactor and cleanup the treatment of keyspace IDs and topodatapb.KeyRange values overall. The treatment of keyspace IDs was very inconsistent resulting in many comparisons not properly handling the KeyRange-specific edge cases (mainly due to logic using bytes.Compare scattered across the codebase):

  • The Start and End fields comparing empty/zero values as minimum-key and maximum-key, respectively.
  • Inconsistently handling Start or End fields containing nil and zero-length []byte.
  • Handling of functionally equal but differently represented values, primarily different-length keys e.g. 80 vs. 8000. (Attempts were made previously to resolve this with addPadding which proved incomplete; my initial fix which we shipped/tested internally added calls to addPadding to everywhere keyspace IDs were being passed, which was effective but very ugly and poorly factored.)

In addition I've added a number of tests and improved test coverage of much of the existing code, as well as clarifying existing comments and adding new ones. All of the existing tests continue to pass, with one exception which appears to be due to a bug in the original code which was then codified in the corresponding test:

Side quest: a bug fix for TestKeyRangeContiguous

TestKeyRangeContiguous tested "-" and "-40" and expected a result of true, but those key ranges are most definitely not contiguous. The corresponding code had the following set of logic:

	if left == nil {
		return right == nil || (len(right.Start) == 0 && len(right.End) == 0)
	}
	if right == nil {
		return len(left.Start) == 0 && len(left.End) == 0
	}

That's attempting to handle nil values (although inconsistently) but treating nil as a missing value (inconsistently), whereas the rest of the code treats nil as full-range. I believe this should as well. I've replaced it with new logic, which I believe to be more correct:

	if KeyRangeIsComplete(a) || KeyRangeIsComplete(b) {
		return false // no two KeyRange values can be contiguous if either is the complete range
	}

If there is some reason this should treat nil differently, let me know and I'll be happy to fix that and document it. However, there is only one call site and it doesn't read to me like it requires any special/different handling, so I believe this was just a misunderstanding.

Future work

There's still a fair bit of work to be done to clean this up, I believe at least the following would be reasonable:

  • Introduce a new explicit type KeyspaceID or similar for keyspace IDs and replace usage of []byte to the extent possible.
    • Adjust the naming/namespacing and interfaces for all the various functions on them to accomodate.
    • Additionally remove the type Uint64Key (which is only used in tests) in favor of the new type.
  • Perhaps introduce a first-class type KeyRange, re-organize all the various comparison functions, and consistently handle both parsing and formatting KeyRanges in a single set of functions.
  • Remove EvenShardsKeyRange and GenerateShardRanges which behave very questionably at best and are rarely used.

I did not attempt to do any of these here to keep the PR size reasonable, but if there is some level of agreement (or at least a lack of disagreement) I'd be happy to undertake them.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

There should not be any user-facing impact of this change, as it's purely backend and bug fixing existing functionality which was perhaps never used before.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 1, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 1, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@mattlord mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Query Serving labels Mar 1, 2023
@harshit-gangal harshit-gangal removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 1, 2023
go/vt/key/key.go Outdated Show resolved Hide resolved
Copy link
Member

@harshit-gangal harshit-gangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change and I think it will resolve the issue related to key range matching.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Nice work on this, @jeremycole ! ❤️ I only had some minor nits/comments/suggestions. Please let me know what you think.

go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key.go Show resolved Hide resolved
go/vt/key/key_test.go Show resolved Hide resolved
go/vt/key/key.go Outdated Show resolved Hide resolved
go/vt/key/key_test.go Outdated Show resolved Hide resolved
go/vt/key/key_test.go Outdated Show resolved Hide resolved
@deepthi
Copy link
Member

deepthi commented Mar 1, 2023

@jeremycole others have done a detailed review, so I just wanted to chime in say that this is great work! The other proposed changes all seem to be good ones from a code readability / quality / maintainability point of view as well, so there is no objection to proceeding along those lines.

You did say in the description that you ran into a bug. Is there an issue for that which can be linked here? If not, do you mind creating one?
It will also be good to create an issue for the proposed improvements to link to an eventual PR.

go/vt/key/key.go Outdated
Comment on lines 60 to 62
func Normalize(id []byte) []byte {
if len(id) >= 8 {
return id[:8]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One breaking change that we should talk about is restricting it is 8-byte length. which the current implementation does not do.
Current implementation support keyspace id to be arbitrary length based on the Vindex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of longer-than-8-byte keyspace IDs can be seen here (link courtesy @harshit-gangal)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I (also) mistakenly thought they were always 64 bits, but that's only the length of the default hash vindex based keyspace IDs. Some are shorter and others longer. You can see examples of how many hex digits are produced for each vindex type by using the vindex query functions: https://vitess.io/docs/17.0/reference/features/vindexes/#query-vindex-functions

I was also pointed this blog post that I don't remember previously reading: https://vitess.io/blog/2017-09-18-custom-sharding-with-vitess/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean here. I am not sure that's a real case that anyone is using but I am quite okay with not changing the semantics here. We can pivot the Normalize method to, instead of padding out to a fixed length (which there isn't any value that would satisfy the requirements without defining a maximum length) to instead remove all trailing zeroes consistently.

I made that change locally and it works fine and all tests pass the same except for TestNormalize (since it's explicitly checking the return values of Normalize, so it would be expected). I'll push another commit for this.

@jeremycole
Copy link
Contributor Author

Thanks for the prompting, I filed #12535 with the actual bug we hit that caused us to go down this rabbit hole. :)

@jeremycole
Copy link
Contributor Author

👋 I think I have addressed all review feedback with the last commits pushed. Sorry for adding everyone as reviewers, every time I touch a new file it seems we get a new reviewer added automatically. 😄

I broke up the additional work into several separate commits so that those changes can more easily be reviewed in isolation if you so desire. Let me know anything else you need! (And we'll see how the full CI runs go.)

@deepthi
Copy link
Member

deepthi commented Mar 8, 2023

Looks like you broke something, both unit_test and local example are failing. On the plus side, they should be easy to reproduce locally!

@jeremycole
Copy link
Contributor Author

The failure in unit_test was a flake unrelated to my changes, seems to be a timeout or something in gRPC stuff resulting in FAIL: TestMtlsAuth (60.11s). It succeeds locally.

I haven't quite gotten the local_example to work locally but in the mean time would y'all mind rerunning those tests?

@deepthi
Copy link
Member

deepthi commented Mar 8, 2023

Looks like the local example failure is a flake as well. I see it failing on a PR that touches no code here #12566
https://github.com/vitessio/vitess/actions/runs/4356176371/jobs/7613821315

@deepthi
Copy link
Member

deepthi commented Mar 8, 2023

@jeremycole I re-ran the 3 failing tests. I've also invited you to the org, so that in future you will have the ability to do this too. Invite goes to your primary GH email.

@jeremycole
Copy link
Contributor Author

@mattlord does this address your comments and concerns?

@jeremycole
Copy link
Contributor Author

👋 We're still holding on this but using it in our internal branch. Are there any remaining concerns about it?

@mattlord
Copy link
Contributor

👋 We're still holding on this but using it in our internal branch. Are there any remaining concerns about it?

Hey! Sorry for the delay. I'll add this review to my ToDo for tomorrow.

@mattlord mattlord self-requested a review March 22, 2023 06:57
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, @jeremycole ! The quality of the code, comments, PR description, issue etc are all very good. ❤️

@mattlord mattlord merged commit 47f7234 into vitessio:main Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Creating shards for a keyspace with differing keyspace ID lengths fails to mark tablet as serving
5 participants