Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle some gas corner-cases #2537

Merged
merged 35 commits into from
Aug 29, 2022
Merged

Handle some gas corner-cases #2537

merged 35 commits into from
Aug 29, 2022

Conversation

ljoss17
Copy link
Contributor

@ljoss17 ljoss17 commented Aug 8, 2022

Closes: #2487

Description

Error 1: "Too much gas wanted"

For this issue Hermes report has been changed from "success" to "submitted" to avoid confusion.

Error 2: "out of gas in location" during CheckTx

A warning during the health-check is displayed if the gas_multiplier is lower than 1.1

Error 3: "out of gas in location" during DeliverTx

The interval between client refresh calls has been changed to be the client's trusting_period/100. In addition, the client refresh method has been updated in order to catch ChainError resulting events. The interval between refresh() calls which fail to update the client increases using the Fibonacci sequence.


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).
  • Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

relayer/src/error.rs Outdated Show resolved Hide resolved
if let Some(ibc_events) = res.clone() {
match ibc_events
.into_iter()
.find(|e| e.event_type() == IbcEventType::ChainError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document the assumption we're making.

Suggested change
.find(|e| e.event_type() == IbcEventType::ChainError)
// We make the assumption that we push a single `ChainError` variant
.find(|e| e.event_type() == IbcEventType::ChainError)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, alternatively, we could have strong type guarantees on this. CC @soareschen @romac

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @soareschen, I think we should get rid of these ChainError events altogether and figure out a better way to bubble up such errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #2565

self.dst_chain().id(),
ev,
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many different execution branches and it's not clear what are the post-conditions of the refresh method. Can we briefly describe that?

Let's maybe also put a comment to describe what is this match arm for. There is also an Ok(res) and a Err(e branches.

@@ -14,6 +14,9 @@ use crate::{

use super::WorkerCmd;

const TRUSTING_PERIOD_DIVIDER: u32 = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like for Tendermint clients we would refresh every ~3 hours if the trusting period is 18 days. This seems a bit too infrequest. We should retry more frequently.

@adizere
Copy link
Member

adizere commented Aug 11, 2022

The interval between client refresh calls has been changed to be the client's trusting_period/100. In addition, the client refresh method has been updated in order to catch ChainError resulting events.

Notes from discussing with Luca:

It seems that there are 3 cases:

  1. no update required at the moment -> the client worker can backoff with a large window eg several hours, a fraction of the refresh_period
  2. successful update -> similar to case 1, but the refresh window could be even larger, still a fraction of refresh_period
  3. failed to update -> the client worker should retry aggressively; if the errors keep coming up (such as it is signaled in Hermes corner-cases related to gas #2487 error (3)), then the backoff between retries can be increased (eg Fibonacci or linear increase, but not constant as it currently is).

@soareschen
Copy link
Contributor

@ljoss17 your changes are causing the integration tests to hang indefinitely. Can you merge #2567 to your branch and also fix the tests?

@ljoss17 ljoss17 marked this pull request as ready for review August 16, 2022 12:51
@ljoss17 ljoss17 added this to the v1.1 milestone Aug 18, 2022
@romac romac marked this pull request as draft August 18, 2022 15:12
// Check that the gas_multiplier is greater than or equal to 1.0
if let Some(gas_multiplier) = config.gas_multiplier {
if gas_multiplier < 1.0 {
return Err(Diagnostic::Error(Error::invalid_gas_multiplier(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take care to also remove the Error enum variant if it is no longer in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you, I forgot to remove it.

Comment on lines -74 to -75
assert!(gas_multiplier >= 1.0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it here, just in case someone removes the checks in a refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert! was replaced so that the test, in client_refresh.rs, using a gas_multiplier = 0.8 to trigger a failed refresh() runs correctly. The checks are now in the new GasMultiplier struct:
https://github.com/informalsystems/ibc-rs/blob/5c522c4d72659ceec7b81ee31ecc3dddf9460700/relayer/src/config/gas_multiplier.rs#L18-L27
I am not sure if there is a way to easily reproduce the failure while keeping a gas_multiplier >= 1.0.

fn gas_multiplier_from_config(config: &ChainConfig) -> f64 {
config.gas_multiplier.unwrap_or(DEFAULT_GAS_MULTIPLIER)
pub fn gas_multiplier_from_config(config: &ChainConfig) -> f64 {
config.gas_multiplier.unwrap_or_default().to_f64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -258,6 +258,86 @@ pub mod memo {
}
}

pub use gas_multiplier::GasMultiplier;

pub mod gas_multiplier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to create a separate module file, this is quite long already for an inline module.

relayer/src/config/types.rs Outdated Show resolved Hide resolved
relayer/src/config/types.rs Outdated Show resolved Hide resolved
@@ -459,6 +459,15 @@ define_error! {
e.chain_id, e.default_gas, e.max_gas)
},

ConfigValidationGasMultiplierLow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ConfigValidationThisAndThat variants in this enum. Prefixing in Rust usually means we should refactor things in a more elegant and concise way.

I wonder if there is a ConfigValidationError enum hiding here, which would make one variant of relayer::Error and be also convertible to that with a From impl. Maybe not for this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open an issue for this

Ok(res) => {
// If elapsed < refresh_window for the client, `try_refresh()` will
// be successful with an empty vector.
if let Some(ibc_events) = res.clone() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid cloning the whole collection here: if it destructures to Some(ibc_events), you already own it, if it's None, you don't need the res variable to return Ok(None). Can also use convenience methods here like Option::map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so I removed the clone() in commit 47a49b9, but I am not sure I understand how Option::map could be used here. Would the idea be to apply the .map() to res in order to avoid the Ok(None) when res is None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see if res.map(|ibc_events| { ... }) could be more elegant here.

@seanchen1991 seanchen1991 marked this pull request as ready for review August 22, 2022 15:45
relayer/src/config/gas_multiplier.rs Outdated Show resolved Hide resolved
@@ -787,17 +787,17 @@ impl<DstChain: ChainHandle, SrcChain: ChainHandle> ForeignClient<DstChain, SrcCh
Ok(res) => {
// If elapsed < refresh_window for the client, `try_refresh()` will
// be successful with an empty vector.
if let Some(ibc_events) = res.clone() {
if let Some(ibc_events) = &res {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, but I think you can just destructure res since this scope owns it and it's not used anywhere else.

Suggested change
if let Some(ibc_events) = &res {
if let Some(ibc_events) = res {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue I am having is that res is used at this line.

https://github.com/informalsystems/ibc-rs/blob/47a49b95c32b5cbf481d67d897b8ddba9fc51e6a/relayer/src/foreign_client.rs#L804

In order to have a more elegant refresh() method, a part of it was extracted in a function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can rewrite refresh as follows to avoid such nesting:

    pub fn refresh(&mut self) -> Result<Option<Vec<IbcEvent>>, ForeignClientError> {
        match self.try_refresh() {
            // If elapsed < refresh_window for the client, `try_refresh()` will
            // be successful with an empty vector.
            Ok(Some(ibc_events)) => Self::verify_chain_error(ibc_events, self.dst_chain().id()),

            // Return the successful empty vector.
            Ok(None) => Ok(None),

            // Return the error
            Err(e) => Err(e),
        }
    }

or by using the ? operator on the result of try_refresh:

    pub fn refresh(&mut self) -> Result<Option<Vec<IbcEvent>>, ForeignClientError> {
        // If elapsed < refresh_window for the client, `try_refresh()` will
        // be successful with an empty vector.
        match self.try_refresh()? {
            Some(events) => Self::verify_chain_error(events, self.dst_chain().id()),
            None => Ok(None),
        }
    }

or even like this, but I find the latter harder to read:

    pub fn refresh(&mut self) -> Result<Option<Vec<IbcEvent>>, ForeignClientError> {
        // If elapsed < refresh_window for the client, `try_refresh()` will
        // be successful with an empty vector.
        self.try_refresh().and_then(|opt_events| {
            opt_events
                .map(|events| Self::verify_chain_error(events, self.dst_chain().id()))
                .transpose()
                .map(Option::flatten)
        })
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to have verify_chain_error (renamed below to check_no_errors) return a Result<(), _>:

    pub fn refresh(&mut self) -> Result<Option<Vec<IbcEvent>>, ForeignClientError> {
        fn check_no_errors(
            ibc_events: &[IbcEvent],
            dst_chain_id: ChainId,
        ) -> Result<(), ForeignClientError> {
            // The assumption is that only one IbcEventType::ChainError will be
            // in the resulting Vec<IbcEvent> if an error occurred.
            let chain_error = ibc_events
                .iter()
                .find(|&e| e.event_type() == IbcEventType::ChainError);

            match chain_error {
                None => Ok(()),
                Some(ev) => Err(ForeignClientError::chain_error_event(
                    dst_chain_id,
                    ev.to_owned(),
                )),
            }
        }

        // If elapsed < refresh_window for the client, `try_refresh()` will
        // be successful with an empty vector.
        if let Some(events) = self.try_refresh()? {
            check_no_errors(&events, self.dst_chain().id())?;
            Ok(Some(events))
        } else {
            Ok(None)
        }
    }

Copy link
Member

@romac romac Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like the latter solution the best, as it seems a bit weird to me to have veriy_chain_errors return an Option<Vec<_>> where the Option will always be Some, ie. the signature does not really tell us what is going on, and could also have us think that the events that are returned are not the very same than the one which are passed in.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solutions are great, and I agree that the last one is better, as the check_no_errors makes it more understandable.

relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
@mzabaluev mzabaluev self-requested a review August 25, 2022 10:51
Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@romac romac merged commit c1a5fe7 into master Aug 29, 2022
@romac romac deleted the 2487_gas_fixes branch August 29, 2022 15:00
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes corner-cases related to gas
5 participants