-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix retries and metrics #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! I suggest that the failed headers counter is incremented only once, after the last retry has failed.
There could be a case that a RPC node takes too long to answer a few queries, but after some retries the header gets relayed. In this scenario, incrementing the failed headers counter (perhaps even multiple times) doesn't make sense to me, as the header got relayed eventually.
bbnrelayer/client.go
Outdated
@@ -146,6 +153,8 @@ func (r *Relayer) KeepUpdatingClient( | |||
// the endpoint of dst chain is temporarily unavailable | |||
// TODO: distinguish unrecoverable errors | |||
} | |||
|
|||
r.metrics.RelayedHeadersCounter.WithLabelValues(src.ChainID(), dst.ChainID()).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still reachable if err != nil
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Missed an else statement here 😅 fixed
@@ -96,7 +96,7 @@ func keepUpdatingClientsCmd() *cobra.Command { | |||
} | |||
|
|||
cmd.Flags().Duration("interval", time.Minute*10, "the interval between two update-client attempts") | |||
cmd.Flags().Uint("retry", 20, "number of retry attempts for requests") | |||
cmd.Flags().Uint("retry", 5, "number of retry attempts for requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not entirely understand how the number of retries affects us in the bug. What's the difference between having 5
and 20
in the bug getting reproduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't fully understand this as well. When I tested this with an unreachable endpoint locally, if the value is small then the program panicks after some retries which is expected, but if the value is big (I set 100) then the program just stucks, for a very long time. Meanwhile, from the log the backoff is linear rather than exponential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the attached logfile, the backoff seems to be exponential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if this is the case then I believe exponential backoff is the problem: with 20 the wait time of the last retry can be 2^20 seconds...
Actually in the current implementation, the counter increments only when all retries fail. Specifically, here |
Fixes BM-552
Context: https://github.com/babylonchain/infra/pull/73 and some offline dicussions
This PR fixes the issue that
cosmos_relayer_failed_headers
never occurs. It seems to be because the default number of retries in the code is too big. During the retry process, no log is produced. This PR reduces the default value to 5, and produces logs when each retry attempt fails. It also fixes a minor bug that theRelayedHeadersCounter
increases even if the header is relayed successfully.Tested locally and I can see the
cosmos_relayer_failed_headers
metrics is increasing correctly for the Injective node that is unreachable.Also each retry attempt now produces logs.