fix: fix retries and metrics #20

SebastianElvis · 2023-03-01T03:42:29Z

Fixes BM-552
Context: https://github.com/babylonchain/infra/pull/73 and some offline dicussions

This PR fixes the issue that cosmos_relayer_failed_headers never occurs. It seems to be because the default number of retries in the code is too big. During the retry process, no log is produced. This PR reduces the default value to 5, and produces logs when each retry attempt fails. It also fixes a minor bug that the RelayedHeadersCounter increases even if the header is relayed successfully.

Tested locally and I can see the cosmos_relayer_failed_headers metrics is increasing correctly for the Injective node that is unreachable.

Also each retry attempt now produces logs.

filippos47

Thanks for fixing this! I suggest that the failed headers counter is incremented only once, after the last retry has failed.
There could be a case that a RPC node takes too long to answer a few queries, but after some retries the header gets relayed. In this scenario, incrementing the failed headers counter (perhaps even multiple times) doesn't make sense to me, as the header got relayed eventually.

vitsalis · 2023-03-01T08:04:03Z

bbnrelayer/client.go

@@ -146,6 +153,8 @@ func (r *Relayer) KeepUpdatingClient(
 			// the endpoint of dst chain is temporarily unavailable
 			// TODO: distinguish unrecoverable errors
 		}
+
+		r.metrics.RelayedHeadersCounter.WithLabelValues(src.ChainID(), dst.ChainID()).Inc()


This is still reachable if err != nil right?

Nice catch. Missed an else statement here 😅 fixed

vitsalis · 2023-03-01T08:05:21Z

cmd/start.go

@@ -96,7 +96,7 @@ func keepUpdatingClientsCmd() *cobra.Command {
 	}

 	cmd.Flags().Duration("interval", time.Minute*10, "the interval between two update-client attempts")
-	cmd.Flags().Uint("retry", 20, "number of retry attempts for requests")
+	cmd.Flags().Uint("retry", 5, "number of retry attempts for requests")


I do not entirely understand how the number of retries affects us in the bug. What's the difference between having 5 and 20 in the bug getting reproduced?

Actually I don't fully understand this as well. When I tested this with an unreachable endpoint locally, if the value is small then the program panicks after some retries which is expected, but if the value is big (I set 100) then the program just stucks, for a very long time. Meanwhile, from the log the backoff is linear rather than exponential.

From the attached logfile, the backoff seems to be exponential.

Right, if this is the case then I believe exponential backoff is the problem: with 20 the wait time of the last retry can be 2^20 seconds...

SebastianElvis · 2023-03-01T08:35:59Z

Thanks for fixing this! I suggest that the failed headers counter is incremented only once, after the last retry has failed. There could be a case that a RPC node takes too long to answer a few queries, but after some retries the header gets relayed. In this scenario, incrementing the failed headers counter (perhaps even multiple times) doesn't make sense to me, as the header got relayed eventually.

Actually in the current implementation, the counter increments only when all retries fail. Specifically, here UpdateClient returns error only when all retries fail in one of its retryable function invocations.

init

d22f26f

SebastianElvis requested review from vitsalis and filippos47 March 1, 2023 03:42

onretry

b8f9ffd

filippos47 approved these changes Mar 1, 2023

View reviewed changes

vitsalis reviewed Mar 1, 2023

View reviewed changes

fix comments

64cdf6c

SebastianElvis merged commit 6b7e5fe into dev Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix retries and metrics #20

fix: fix retries and metrics #20

SebastianElvis commented Mar 1, 2023 •

edited

Loading

filippos47 left a comment

vitsalis Mar 1, 2023

SebastianElvis Mar 1, 2023 •

edited

Loading

vitsalis Mar 1, 2023

SebastianElvis Mar 1, 2023

filippos47 Mar 1, 2023

SebastianElvis Mar 1, 2023

SebastianElvis commented Mar 1, 2023

fix: fix retries and metrics #20

fix: fix retries and metrics #20

Conversation

SebastianElvis commented Mar 1, 2023 • edited Loading

filippos47 left a comment

Choose a reason for hiding this comment

vitsalis Mar 1, 2023

Choose a reason for hiding this comment

SebastianElvis Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

vitsalis Mar 1, 2023

Choose a reason for hiding this comment

SebastianElvis Mar 1, 2023

Choose a reason for hiding this comment

filippos47 Mar 1, 2023

Choose a reason for hiding this comment

SebastianElvis Mar 1, 2023

Choose a reason for hiding this comment

SebastianElvis commented Mar 1, 2023

SebastianElvis commented Mar 1, 2023 •

edited

Loading

SebastianElvis Mar 1, 2023 •

edited

Loading