Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GetTabletsByCell in healthcheck #14693

Merged
merged 11 commits into from
Dec 12, 2023
Merged

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Dec 6, 2023

Description

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

  • GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error.
  • Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency.
  • We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
  • The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known.

The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in #3965 are alleviated by fetching all tablets in one call to the topo.

Related Issue(s)

Fixes #14277

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Dec 6, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Dec 6, 2023
@github-actions github-actions bot added this to the v19.0.0 milestone Dec 6, 2023
@deepthi deepthi added Type: Performance Component: Query Serving and removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Dec 6, 2023
@mattlord mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Dec 6, 2023
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this! Makes for a great improvement.

go/vt/topo/tablet.go Show resolved Hide resolved
go/vt/topo/tablet.go Show resolved Hide resolved
go/vt/topo/tablet.go Show resolved Hide resolved
go/vt/topo/tablet.go Outdated Show resolved Hide resolved
go/vt/topo/tablet.go Outdated Show resolved Hide resolved
go/vt/topo/tablet_test.go Show resolved Hide resolved
go/vt/topo/tablet_test.go Show resolved Hide resolved
go/stats/counters.go Show resolved Hide resolved
go/vt/topo/tablet.go Show resolved Hide resolved
@deepthi
Copy link
Member Author

deepthi commented Dec 7, 2023

I'm working on a test case as suggested by @mattlord, so we'll merge after that is done.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some minor comments. The most consequential ones -- although still relatively minor -- were related to using a semaphore and deferring mutex unlocking. Let me know what you think. I can come back to this quickly and re-approve at any time.

go/vt/discovery/topology_watcher.go Outdated Show resolved Hide resolved
go/vt/discovery/topology_watcher.go Show resolved Hide resolved
}
if _, err := ts.UpdateTabletFields(context.Background(), tablet.Alias, func(t *topodatapb.Tablet) error {
})
require.Nil(t, err, "UpdateTabletFields failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use require.Nil instead of require.NoError here and in a few other places? It's fine, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the func has two return values, not just a single return value of type error. I'd have preferred the NoError form, but it doesn't work for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you can use this instead (in the other cases you're just using the error return value directly from the function):

require.NoError(t, err, "UpdateTabletFields failed")

AFAIK they are equivalent, but it's a little more explicit. Not a problem at all either way.

tw.loadTablets()
checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "GetTablet": 0, "RemoveTablet": 1})
checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "RemoveTablet": 1})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this new call is equivalent to the old one. I tend to prefer the previous as it's more explicit about what we expect (w/o looking at how checkOpCounts works), but up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call GetTablet at all any more, so this change reflects that fact. I thought about deprecating and eventually deleting the GetTablet metric, but it didn't seem worth it.
Maybe it is worth changing the tests to explicitly expect 0 calls to GetTablet as a way of catching regressions in this behavior.

go/vt/discovery/topology_watcher_test.go Show resolved Hide resolved
go/vt/topo/tablet_test.go Show resolved Hide resolved
go/vt/topo/memorytopo/memorytopo.go Show resolved Hide resolved
go/vt/topo/tablet_test.go Show resolved Hide resolved
go/vt/topo/tablet_test.go Outdated Show resolved Hide resolved
go/vt/topo/tablet_test.go Outdated Show resolved Hide resolved
…tch previous behavior, address review

Signed-off-by: deepthi <[email protected]>
@deepthi
Copy link
Member Author

deepthi commented Dec 12, 2023

@mattlord I think I've addressed all the comments. This is ready for final review. If it looks good to you, please remove the "Do Not Merge" label.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor notes -- sorry to keep going back and forth on this. IMO it's worth being extra cautious in this case though.

tabletInfo, err := ts.GetTablet(ctx, tabletAlias)
mutex.Lock()
<-sem
mu.Lock()
Copy link
Contributor

@mattlord mattlord Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can get rid of this mutex usage here and replace the related returnError logic with sync.Once?

The other option is to use a concurrency.FirstErrorRecorder. Those both feel more standard to me.

I'm being paranoid (probably overly so) as it's easy to end up with deadlocks and panics. I guess we have unit test coverage though and thus -race tests for the function?

tabletInfo, err := ts.GetTablet(ctx, tabletAlias)
mutex.Lock()
<-sem
mu.Lock()
Copy link
Contributor

@mattlord mattlord Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we do end up locking it, we should unlock it in a defer here as we release it at the end of the closure/goroutine anyway.


for _, tabletAlias := range tabletAliases {
wg.Add(1)
go func(tabletAlias *topodatapb.TabletAlias) {
defer wg.Done()
if err := sem.Acquire(ctx, 1); err != nil {
// Only happens if context is cancelled.
mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should do this:

mu.Lock()
defer mu.Unlock()
...
return

We definitely want to return in this block as we didn't acquire the semaphore so we don't want to continue on in the goroutine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be getting hard to follow at this point, so this is what I was thinking: https://gist.github.com/mattlord/089986f60bffe33d1d7b3faa21942e06

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I was looking at the code again and realized I forgot a return 😮‍💨
Glad you caught it too. I think we actually need a lot more test coverage than we have, but I'm calling that debt at this point. We are slightly better off than before given that at least I added 4 unit tests.
Error conditions are usually the hardest to test, and that is where most the missing coverage tends to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use different functions in sync.Once.Do, it will only execute the Do method once. But it doesn't matter. It's not much different than what is there now.

@mattlord mattlord self-requested a review December 12, 2023 03:58
// Server.FindAllShardsInKeyspace.
type GetTabletsByCellOptions struct {
// Concurrency controls the maximum number of concurrent calls to GetTablet.
// For backwards compatibility, concurrency of 0 is considered unlimited.
Copy link
Contributor

@mattlord mattlord Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is right anymore, is it? I say that because we currently only use the value if it's > 0 (at least in GetTabletMap).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. this has gone through 3 different implementations at this point, and somewhere in there we lost the infinite concurrency. I'm inclined to think 32 is fine as a default and we don't need to worry about infinite concurrency at all 🤷

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (again)! 🙂 I'm glad to see that we finally got this done. Thank you for that.

@deepthi deepthi merged commit 5d05612 into vitessio:main Dec 12, 2023
116 of 117 checks passed
@deepthi deepthi deleted the ds-get-tablets branch December 12, 2023 21:01
ejortegau pushed a commit to slackhq/vitess that referenced this pull request Dec 13, 2023
notfelineit pushed a commit to planetscale/vitess that referenced this pull request Mar 12, 2024
…etsByCell in healthcheck (vitessio#4572)

* backport of 3950

* fix conflicts

Signed-off-by: deepthi <[email protected]>

* fix test failures

Signed-off-by: deepthi <[email protected]>

* change GH runner to unblock PR build

Signed-off-by: deepthi <[email protected]>

---------

Signed-off-by: deepthi <[email protected]>
Co-authored-by: deepthi <[email protected]>
frouioui pushed a commit to planetscale/vitess that referenced this pull request Mar 26, 2024
…etsByCell in healthcheck (vitessio#4573)

* backport of 3950

* fix conflicts

Signed-off-by: deepthi <[email protected]>

* fix test failures

Signed-off-by: deepthi <[email protected]>

* change GH runner to unblock PR build

Signed-off-by: deepthi <[email protected]>

---------

Signed-off-by: deepthi <[email protected]>
Co-authored-by: deepthi <[email protected]>
ejortegau added a commit to slackhq/vitess that referenced this pull request Sep 16, 2024
This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

* GetTabletsForCell now handles the case where the response size violates gRPC limits by
  falling back to one tablet at a time in case of error.
* Previously, the one tablet at a time method had unlimited concurrency. In this PR we
  introduce a configuration option for concurrency.
* We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
* The behavior of --refresh_known_tablets flag is different now. Previously we would not
  read those tablets at all, now we do read them, but ignore any changes if they are
  already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.
ejortegau added a commit to slackhq/vitess that referenced this pull request Oct 4, 2024
* Backport Use GetTabletsByCell in healthcheck

This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go
version we are using and a small change to topology_watcher.go so that test cases reflect
and test for the same behavior as the upstream code. The description of the original PR
follows:

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers
in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

* GetTabletsForCell now handles the case where the response size violates gRPC limits by
  falling back to one tablet at a time in case of error.
* Previously, the one tablet at a time method had unlimited concurrency. In this PR we
  introduce a configuration option for concurrency.
* We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
* The behavior of --refresh_known_tablets flag is different now. Previously we would not
  read those tablets at all, now we do read them, but ignore any changes if they are
  already known.

The basic fix has already been tried in production and shown to reduce the number of Get
calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release.
The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching
all tablets in one call to the topo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Reduce number of topo calls from vtgate healthcheck
3 participants