-
Notifications
You must be signed in to change notification settings - Fork 825
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support Dynamic Cluster IP Addresses in Failure Scenarios (#495)
Adds a retry mechanism during RingPop Bootstrap where if we encounter a bootstrap failure, we retry up to 5 more times before crashing the process, refreshing the bootstrap list prior to each retry. We suspect (and were able to repro on onebox) that the node is unable to join a ringpop cluster if all of the supplied seed nodes are invalid. Background: Our bootstrap logic relies on nodes in a Temporal cluster writing their Host:Ports periodically to a table. In the case of a cluster that is cold-starting, all of those written IP addresses may no longer be valid, so no node would be able to start until those heartbeats expire. Furthermore, the node would write its own heartbeat, fail to start, immediately recycle and potentially get a new IP address meaning that the heartbeat it just wrote is no longer valid, which will negatively impact other nodes (and itself) the same way. This means that the situation could never stabilize. This fix will retry refreshing the bootstrap list and joining the RingPop cluster without recycling the process up to 5 additional times. The node will continue to write its heartbeats during this process. This basically increases the window of time that this node is discoverable by other nodes (and vice-versa) and ensures that our retries are using the freshest bootstrap list possible. Because this issue reproduces on onebox, we were able to write unit tests and test locally to verify that the retry logic works and that bootstrap can be invoked on the same ringpop object multiple times without any feature of repercussion (its internal initialization code is also idempotent). We also inspected the ringpop library code to validate that 1) our understanding of the problem is correct and 2) multiple bootstrap retries would work. This has not explicitly been verified on staging, but can be done after the merge to master given the low risks. The risk here is substantially low - this is addressing a situation where the cluster degenerates into an unstable state. It does not affect the happy path (e.g. first-time startup, single-node cluster startup, stable cluster startup). In the worst case, this fix doesn't solve the problem and the cluster is still unhealthy and fails to start.
- Loading branch information
1 parent
eb79edc
commit 1d4a36c
Showing
3 changed files
with
75 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters