Topology watcher refreshKnownTablets option #3965

demmer · 2018-05-22T15:37:07Z

Overview

Adds an option to the topology watcher to reduce the topo k/v polling load in environments where the tablets never change their address/port information once launched.

Motivation

The TopologyWatcher module in vtgate is responsible for periodically checking the topo service to find out whether new tablets have been provisioned so that HealthCheck can be notified.

To support environments in which the tablet may change address/port information, i.e. when the association between a tablet alias and the host/port map isn't stable over time, the default behavior gets a list of all the tablet aliases and then re-reads the topo k/v for each tablet.

This operation is by far the majority of the k/v polling load from a vtgate, and as the cluster grows, the rate of k/v requests is NumVtgates * NumVttablets / PollingInterval, which grows quickly as the cluster grows.

Changes

To reduce this load, this PR adds a refreshKnownTablets option to the TopologyWatcher and a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes.

However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data.

Testing

I added extensive unit tests for this but have not (yet) verified in a real environment.

Using the newly added counters, add verification that the various operation counts occur as expected. This required also adding calls to topo.FixShardReplication in the to avoid differences in the operation counts between the two types of topology watchers. Signed-off-by: Michael Demmer <[email protected]>

Unlike ResetAll, ZeroAll keeps all the same keys in the map but changes all the values to zero. Signed-off-by: Michael Demmer <[email protected]>

Instead of tracking all the tablets by the TabletToMapKey value, use the alias as the key to all the data structures used in the scan comparisons. This change mostly doesn't change the behavior at all, with one exception when a tablet with a known alias changes the value of its address key. Previously the watcher would call AddTablet, then RemoveTablet, now it explicitly calls ReplaceTablet, which has the same net effect and seems more correct. Signed-off-by: Michael Demmer <[email protected]>

Add a refreshKnownTablets option for the TopologyWatcher and a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes. However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data. Signed-off-by: Michael Demmer <[email protected]>

Signed-off-by: Michael Demmer <[email protected]>

sougou · 2018-05-23T02:27:30Z

go/vt/discovery/topology_watcher.go

+	tw.mu.Lock()
+	for _, tAlias := range tabletAliases {
+		if !tw.refreshKnownTablets {
+			aliasStr := topoproto.TabletAliasString(tAlias)


Looks like you decided to not use the alias as the key instead. But healthcheck is still using TabletToMapKey. How do the two coordinate correctly?

The TopologyWatcher now uses the alias as the key for the internal tablets map and all the temporary data structures.

When it calls into healthcheck to add/remove/replace the tablet, it passes the full tablet record. At that point HC recomputes its own hash key from the address map. I think we could (and probably should) switch that to store tablets keyed by the alias as well, but it's not necessary as part of this change.

sougou · 2018-05-23T03:26:09Z

This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge.

demmer · 2018-05-23T03:28:18Z

I also want to caveat that we haven’t tried this in an actual test deployment (yet).

…

-m

On May 22, 2018, at 8:26 PM, Sugu Sougoumarane ***@***.***> wrote: This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.

* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.

demmer added 5 commits May 20, 2018 21:34

add a counters.ZeroAll helper method for tests

67a9490

Unlike ResetAll, ZeroAll keeps all the same keys in the map but changes all the values to zero. Signed-off-by: Michael Demmer <[email protected]>

pass the refreshKnownTablets option in newRealtimeStats

c7ed4f1

Signed-off-by: Michael Demmer <[email protected]>

demmer requested review from alainjobart and sougou May 22, 2018 15:58

sougou reviewed May 23, 2018

View reviewed changes

demmer mentioned this pull request May 29, 2018

TopologyWatcher removes the tablet record when polling the topo service fails #3987

Closed

sougou approved these changes May 29, 2018

View reviewed changes

demmer merged commit 6b81f05 into vitessio:master May 29, 2018

deepthi mentioned this pull request Dec 6, 2023

Use GetTabletsByCell in healthcheck #14693

Merged

4 tasks

ejortegau mentioned this pull request Sep 16, 2024

Backport 14693 - Use GetTabletsByCell in healthcheck slackhq/vitess#514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology watcher refreshKnownTablets option #3965

Topology watcher refreshKnownTablets option #3965

demmer commented May 22, 2018

sougou May 23, 2018

demmer May 23, 2018

sougou commented May 23, 2018

demmer commented May 23, 2018 via email

Topology watcher refreshKnownTablets option #3965

Topology watcher refreshKnownTablets option #3965

Conversation

demmer commented May 22, 2018

Overview

Motivation

Changes

Testing

sougou May 23, 2018

Choose a reason for hiding this comment

demmer May 23, 2018

Choose a reason for hiding this comment

sougou commented May 23, 2018

demmer commented May 23, 2018 via email