Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix retries and metrics #20

Merged
merged 3 commits into from
Mar 1, 2023
Merged

fix: fix retries and metrics #20

merged 3 commits into from
Mar 1, 2023

Conversation

SebastianElvis
Copy link
Member

@SebastianElvis SebastianElvis commented Mar 1, 2023

Fixes BM-552
Context: https://github.com/babylonchain/infra/pull/73 and some offline dicussions

This PR fixes the issue that cosmos_relayer_failed_headers never occurs. It seems to be because the default number of retries in the code is too big. During the retry process, no log is produced. This PR reduces the default value to 5, and produces logs when each retry attempt fails. It also fixes a minor bug that the RelayedHeadersCounter increases even if the header is relayed successfully.

Tested locally and I can see the cosmos_relayer_failed_headers metrics is increasing correctly for the Injective node that is unreachable.

image

Also each retry attempt now produces logs.

image

Copy link
Contributor

@filippos47 filippos47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I suggest that the failed headers counter is incremented only once, after the last retry has failed.
There could be a case that a RPC node takes too long to answer a few queries, but after some retries the header gets relayed. In this scenario, incrementing the failed headers counter (perhaps even multiple times) doesn't make sense to me, as the header got relayed eventually.

@@ -146,6 +153,8 @@ func (r *Relayer) KeepUpdatingClient(
// the endpoint of dst chain is temporarily unavailable
// TODO: distinguish unrecoverable errors
}

r.metrics.RelayedHeadersCounter.WithLabelValues(src.ChainID(), dst.ChainID()).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still reachable if err != nil right?

Copy link
Member Author

@SebastianElvis SebastianElvis Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Missed an else statement here 😅 fixed

@@ -96,7 +96,7 @@ func keepUpdatingClientsCmd() *cobra.Command {
}

cmd.Flags().Duration("interval", time.Minute*10, "the interval between two update-client attempts")
cmd.Flags().Uint("retry", 20, "number of retry attempts for requests")
cmd.Flags().Uint("retry", 5, "number of retry attempts for requests")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not entirely understand how the number of retries affects us in the bug. What's the difference between having 5 and 20 in the bug getting reproduced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't fully understand this as well. When I tested this with an unreachable endpoint locally, if the value is small then the program panicks after some retries which is expected, but if the value is big (I set 100) then the program just stucks, for a very long time. Meanwhile, from the log the backoff is linear rather than exponential.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the attached logfile, the backoff seems to be exponential.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if this is the case then I believe exponential backoff is the problem: with 20 the wait time of the last retry can be 2^20 seconds...

@SebastianElvis
Copy link
Member Author

Thanks for fixing this! I suggest that the failed headers counter is incremented only once, after the last retry has failed. There could be a case that a RPC node takes too long to answer a few queries, but after some retries the header gets relayed. In this scenario, incrementing the failed headers counter (perhaps even multiple times) doesn't make sense to me, as the header got relayed eventually.

Actually in the current implementation, the counter increments only when all retries fail. Specifically, here UpdateClient returns error only when all retries fail in one of its retryable function invocations.

@SebastianElvis SebastianElvis merged commit 6b7e5fe into dev Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants