Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packets not clearing #1918

Closed
5 tasks
RohitAudit opened this issue Feb 24, 2022 · 13 comments
Closed
5 tasks

Packets not clearing #1918

RohitAudit opened this issue Feb 24, 2022 · 13 comments
Labels
I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC O: code-hygiene Objective: cause to improve code hygiene
Milestone

Comments

@RohitAudit
Copy link

RohitAudit commented Feb 24, 2022

Summary of Bug

I am using hermes relayer for gravity-bridge and persistence chain. I brought some tokens from ethereum to gravity and was trying to relay them to the persistence chain. The relayer was stopped and 6000 transaction was sent. When the relayer was started again, all packets got stuck and nothing is getting transmitted. I restarted the relayer after an hour but it is still the same.
Link to the transaction

Debug Logs Printed:

2022-02-24T16:31:49.931978Z  INFO ThreadId(01) Hermes has started
2022-02-24T16:31:57.392575Z DEBUG ThreadId(1529) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: packet seq. that still have commitments on gravity-bridge-3: 227, 228, 229, 230, 231, 232, 233, 234, 235, 236 (first 10 shown here; total=6029)
2022-02-24T16:31:57.392701Z DEBUG ThreadId(1529) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: recv packets to send out to core-1 of the ones with commitments on gravity-bridge-3: 227, 228, 229, 230, 231, 232, 233, 234, 235, 236 (first 10 shown here; total=6029)

Version

v0.11.0 and v0.12.0

Steps to Reproduce

Start a relayer between gravity and persistence and pool more than 6000 transactions and then start the relayer to clear empty packets.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@romac
Copy link
Member

romac commented Feb 24, 2022

You might be able to clear all the packets on a given channel using the clear packets command added in v0.12.0: https://hermes.informal.systems/commands/relaying/clear.html

Nonetheless, it is indeed not normal that Hermes is not clearing any packets on start.

As a sanity check, do you have clear_on_start = true in the config?

@RohitAudit
Copy link
Author

RohitAudit commented Feb 24, 2022

Yes, the config is correct. Also, I did use clear packets command added in v0.12.0. Got the same output and no extra logs for an hour. Should I just leave it for few hours to see if it clears or not?

Can hermes clear the packets in batch selected by us?

@RohitAudit
Copy link
Author

Update:
left the relayer for almost 4-5 hours. Some of the transaction went through but are really slow. It took relayer more than 2 hours to do the first transaction.
Some error logs I got:

Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.682903Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: send_tx: using 2162713 gas, fee Fee { amount: "0ugraviton", gas_limit: 4325426 } id=gravity-bridge-3
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698368Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(74DE7A2A6B5E948EF0C3A5CDBA639A7389542EAFA28C5348CACB4527F300EC0A) }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698614Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: sending 39 messages using account sequence 188
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698986Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: max fee, for use in tx simulation: Fee { amount: "0ugraviton", gas_limit: 200000000 }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.706055Z ERROR ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 0: timeout packet verification failed: clientID (07-tendermint-39), height (1-4963470): consensus state not found", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "993762"} }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.714566Z ERROR ThreadId(1546) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}: will retry: schedule execution encountered error: failed with underlying error: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 0: timeout packet verification failed: clientID (07-tendermint-39), height (1-4963470): consensus state not found", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "993762"} } retry_index=1
Feb 25 04:45:26 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:26.935518Z DEBUG ThreadId(1546) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: packet seq. that still have commitments on gravity-bridge-3: 261, 262, 263, 264, 265, 266, 267, 268, 269, 270 (first 10 shown here; total=5998)

Can you suggest some solution to this?

@adizere
Copy link
Member

adizere commented Feb 25, 2022

Wow, this is an interesting case, thanks for flagging @RohitAudit !

left the relayer for almost 4-5 hours. Some of the transaction went through but are really slow. It took relayer more than 2 hours to do the first transaction.

This is because of two reasons:

  1. Full node queries are slow, so assembling 6K txes (depending on the full node specs) can take a long time.
  2. Hermes has no pagination/batching support, so it tries to naively pull the packet data for all 6K transactions before relaying any single one of them.

One thing we can do is to add pagination support in Hermes (ref: #1816).

Even with pagination support, I'm not 100% about the error "consensus state not found" in your last comment. We're still looking into that. Can you tell us the SDK version that gravity-bridge-3 is running on?

@adizere adizere added O: code-hygiene Objective: cause to improve code hygiene I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC labels Feb 25, 2022
@adizere adizere added this to the v0.14.0 milestone Feb 25, 2022
@RohitAudit
Copy link
Author

gravity-bridge-3 use cosmos-sdk v0.44.5 and ibc version v2.0.2.
Also, if I connect the relayer with archival nodes, will it make the transaction faster??

@adizere
Copy link
Member

adizere commented Feb 25, 2022

gravity-bridge-3 use cosmos-sdk v0.44.5 and ibc version v2.0.2.
Also, if I connect the relayer with archival nodes, will it make the transaction faster??

Thanks for the quick feedback!

WRT archival nodes: I think it may speed things up, yes (esp. if the nodes have also slightly higher-compute specs), but will ask some relayer operators if they have ideas of potential optimizations.

@faddat
Copy link
Contributor

faddat commented Feb 25, 2022

First of all, great work. Testing the network like this is the only way we improve much of anything.

Secondly, having an archive node can help. With that said, you will need very fast hard drives for that to work. Something that our team does when we encounter congestion is we run both the go relayer and Hermes.

They seem to compliment one another in practice in ways that we don’t fully understand yet but definitely have seen in action.

If you do that you will want to have two separate wallets. One for the go relayer and one for Hermes.

Concerning speed, my only consistent piece of advice is you want really fast drives.

@clemensgg
Copy link

in my experience querying archives is less effective because of the sheer db size involved; best would be to have a statesynced node that only retains a chainstate a bit longer than the packets are stuck.

@adizere in my opinion having pagination or batching support in such situations would greatly increase relayer production experience

@faddat
Copy link
Contributor

faddat commented Feb 25, 2022

You know you’re right and I probably should’ve mentioned that the archive node only really helps when you’re dealing with older packets

So one thing you probably do not want to do is state sync right now but instead, next time or you might not have the needed history

@RohitAudit
Copy link
Author

RohitAudit commented Feb 25, 2022

I tried using two hermes relayer but both of them picks the same transaction and one fails.
For now, I have started a go relayer as well. Lets see if it helps increase the speed.
Thanks for the quick response.

@clemensgg
Copy link

clemensgg commented Feb 25, 2022

the only thing that will significantly help this situation are dedicated endpoints for each redundant relayer (particularly on the src_chain)

@faddat
Copy link
Contributor

faddat commented Feb 25, 2022

yep, more endpoints will be more better

@adizere adizere modified the milestones: v0.14.0, v0.13.0 Mar 22, 2022
@adizere
Copy link
Member

adizere commented Mar 22, 2022

We'll be tracking support for query pagination in #1816. Closing this as hopefully addressed.

@adizere adizere closed this as completed Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC O: code-hygiene Objective: cause to improve code hygiene
Projects
No open projects
Status: Closed
Development

No branches or pull requests

5 participants