Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401
Labels
theme/internals
Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
type/bug
Feature does not function as expected
The RPC router will rebalance the server lists to rotate which server should be used for the next RPC in that DC on a scaled interval. In a single node cluster or when running
consul agent -dev
this happens after 2 minutes of running.#7735 added this bit of code into the rebalancing method:
consul/agent/router/manager.go
Lines 352 to 356 in 4c8a15b
It checks if the server in the list is itself and skips the check if so. Then further down in the method we set the online/offline status:
consul/agent/router/manager.go
Lines 369 to 377 in 4c8a15b
The problem is that for single node clusters we never set
foundHealthyServer
so after 2 minutes we unconditionally mark that DC as failed.After all this happens and the DC gets marked as offline. Some RPC will start to fail. In particular if you do a keyring listing via the CLI it will fail with a message saying "Remote DC has no server currently reachable"
The text was updated successfully, but these errors were encountered: