-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology watcher refreshKnownTablets option #3965
Topology watcher refreshKnownTablets option #3965
Conversation
Using the newly added counters, add verification that the various operation counts occur as expected. This required also adding calls to topo.FixShardReplication in the to avoid differences in the operation counts between the two types of topology watchers. Signed-off-by: Michael Demmer <[email protected]>
Unlike ResetAll, ZeroAll keeps all the same keys in the map but changes all the values to zero. Signed-off-by: Michael Demmer <[email protected]>
Instead of tracking all the tablets by the TabletToMapKey value, use the alias as the key to all the data structures used in the scan comparisons. This change mostly doesn't change the behavior at all, with one exception when a tablet with a known alias changes the value of its address key. Previously the watcher would call AddTablet, then RemoveTablet, now it explicitly calls ReplaceTablet, which has the same net effect and seems more correct. Signed-off-by: Michael Demmer <[email protected]>
Add a refreshKnownTablets option for the TopologyWatcher and a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes. However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data. Signed-off-by: Michael Demmer <[email protected]>
Signed-off-by: Michael Demmer <[email protected]>
tw.mu.Lock() | ||
for _, tAlias := range tabletAliases { | ||
if !tw.refreshKnownTablets { | ||
aliasStr := topoproto.TabletAliasString(tAlias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you decided to not use the alias as the key instead. But healthcheck is still using TabletToMapKey. How do the two coordinate correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TopologyWatcher now uses the alias as the key for the internal tablets
map and all the temporary data structures.
When it calls into healthcheck to add/remove/replace the tablet, it passes the full tablet record. At that point HC recomputes its own hash key from the address map. I think we could (and probably should) switch that to store tablets keyed by the alias as well, but it's not necessary as part of this change.
This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge. |
I also want to caveat that we haven’t tried this in an actual test deployment (yet).
…-m
On May 22, 2018, at 8:26 PM, Sugu Sougoumarane ***@***.***> wrote:
This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
Overview
Adds an option to the topology watcher to reduce the topo k/v polling load in environments where the tablets never change their address/port information once launched.
Motivation
The TopologyWatcher module in vtgate is responsible for periodically checking the topo service to find out whether new tablets have been provisioned so that HealthCheck can be notified.
To support environments in which the tablet may change address/port information, i.e. when the association between a tablet alias and the host/port map isn't stable over time, the default behavior gets a list of all the tablet aliases and then re-reads the topo k/v for each tablet.
This operation is by far the majority of the k/v polling load from a vtgate, and as the cluster grows, the rate of k/v requests is
NumVtgates * NumVttablets / PollingInterval
, which grows quickly as the cluster grows.Changes
To reduce this load, this PR adds a
refreshKnownTablets
option to theTopologyWatcher
and a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes.However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data.
Testing
I added extensive unit tests for this but have not (yet) verified in a real environment.