Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401

mkeeler · 2020-07-29T19:11:10Z

The RPC router will rebalance the server lists to rotate which server should be used for the next RPC in that DC on a scaled interval. In a single node cluster or when running consul agent -dev this happens after 2 minutes of running.

#7735 added this bit of code into the rebalancing method:

consul/agent/router/manager.go

Lines 352 to 356 in 4c8a15b

    
           // check to see if the manager is trying to ping itself, 
        
           // continue if that is the case. 
        
           if m.serverName != "" && srv.Name == m.serverName { 
        
           	continue 
        
           }

It checks if the server in the list is itself and skips the check if so. Then further down in the method we set the online/offline status:

consul/agent/router/manager.go

Lines 369 to 377 in 4c8a15b

    
           // If no healthy servers were found, sleep and wait for Serf to make 
        
           // the world a happy place again. Update the offline status. 
        
           if foundHealthyServer { 
        
           	atomic.StoreInt32(&m.offline, 0) 
        
           } else { 
        
           	atomic.StoreInt32(&m.offline, 1) 
        
           	m.logger.Debug("No healthy servers during rebalance, aborting") 
        
           	return 
        
           }

The problem is that for single node clusters we never set foundHealthyServer so after 2 minutes we unconditionally mark that DC as failed.

After all this happens and the DC gets marked as offline. Some RPC will start to fail. In particular if you do a keyring listing via the CLI it will fail with a message saying "Remote DC has no server currently reachable"

The text was updated successfully, but these errors were encountered:

This code started as an optimization to avoid doing an RPC Ping to itself. But in a single server cluster the rebalancing was led to believe that there were no healthy servers because foundHealthyServer was not set. Now this is being set properly. Fixes #8401 and #8403.

mkeeler mentioned this issue Jul 29, 2020

forwardRPC and globalRPC do not take LAN ServerLookup into account when the target datacenter is the local datacenter #8403

Closed

dnephin added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Jul 29, 2020

hanshasselberg mentioned this issue Jul 30, 2020

Mark its own cluster as healthy when rebalancing. #8406

Merged

hanshasselberg closed this as completed in #8406 Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401

Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401

mkeeler commented Jul 29, 2020

Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401

Single Node / Dev Cluster starts emitting RPC errors after 2 minutes of running #8401

Comments

mkeeler commented Jul 29, 2020