Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packet clearing troubleshooting runbook #2612

Open
5 tasks
greg-szabo opened this issue Sep 2, 2022 · 5 comments
Open
5 tasks

Packet clearing troubleshooting runbook #2612

greg-szabo opened this issue Sep 2, 2022 · 5 comments
Labels
A: blocked Admin: blocked by another (internal/external) issue or PR

Comments

@greg-szabo
Copy link
Member

greg-szabo commented Sep 2, 2022

Proposal

I've been trying to clear some packets between Terra and Sifchain. There's one stuck and the runbook I have in my mind doesn't give me enough info to figure out what's happening. I would like more ideas (maybe even document it in a runbook) to do to identify issues while clearing packets.

Basic steps:

  1. Query list of commitments between the two chains.
  2. Run clear packet
  3. Query list of commitments again.
  4. For any stuck packet, query the packet details

Step 1: you can see there's one packet not completely cleared yet. (I re-ran the steps multiple times to see that other processes don't clear it.)

$ hermes --config manual.toml query packet commitments --chain columbus-5 --port transfer --channel channel-7
2022-09-02T23:33:20.279156Z DEBUG ThreadId(01) Options: QueryPacketCommitmentsCmd { chain_id: ChainId { id: "columbus-5", version: 5 }, port_id: PortId("transfer"), channel_id: ChannelId("channel-7") }
SUCCESS PacketSeqs {
    height: Height {
        revision: 5,
        height: 9211691,
    },
    seqs: [
        Sequence(
            2196,
        ),
    ],
}

Step 2: try to clear it:

$ hermes --config manual.toml  clear packets --chain columbus-5 --port transfer --channel channel-7
2022-09-02T23:32:41.336195Z  INFO ThreadId(01) PacketRecvCmd{src_chain=columbus-5 src_port=transfer src_channel=channel-7 dst_chain=sifchain-1}: unreceived packets found: 1
2022-09-02T23:32:41.360177Z  INFO ThreadId(01) PacketRecvCmd{src_chain=columbus-5 src_port=transfer src_channel=channel-7 dst_chain=sifchain-1}: pulled packet data for 0 events; events_total=1 events_left=0
SUCCESS []

It is indicated that it has been cleared. Older versions of Hermes used to show a list of packets involved in the step, I wish that came back so we know our exact seq.num 2196 was the found packet.

Step 3: query again:

$ hermes --config manual.toml query packet commitments --chain columbus-5 --port transfer --channel channel-7
2022-09-02T23:38:17.099377Z DEBUG ThreadId(01) Options: QueryPacketCommitmentsCmd { chain_id: ChainId { id: "columbus-5", version: 5 }, port_id: PortId("transfer"), channel_id: ChannelId("channel-7") }
SUCCESS PacketSeqs {
    height: Height {
        revision: 5,
        height: 9211741,
    },
    seqs: [
        Sequence(
            2196,
        ),
    ],
}

Step 4: let's see what we know of that packet

$ hermes --config manual.toml query packet commitment --chain columbus-5 --port transfer --channel channel-7 --sequence 2196
2022-09-02T23:39:13.644705Z DEBUG ThreadId(01) Options: QueryPacketCommitmentCmd { chain_id: ChainId { id: "columbus-5", version: 5 }, port_id: PortId("transfer"), channel_id: ChannelId("channel-7"), sequence: Sequence(2196), height: None }
SUCCESS "1C08B3E8E8A99F1C0F50E00236FAB59656B839C73320EE8A2AC41F340E0D29D6"

Welp, not very useful.
Maybe pending packets will help?

$ hermes --config manual.toml query packet pending --chain columbus-5 --port transfer --channel channel-7
2022-09-02T23:40:21.910303Z DEBUG ThreadId(01) fetched channel from source chain: IdentifiedChannelEnd { port_id: PortId("transfer"), channel_id: ChannelId("channel-7"), channel_end: ChannelEnd { state: Open, ordering: Unordered, remote: Counterparty { port_id: PortId("transfer"), channel_id: Some(ChannelId("channel-18")) }, connection_hops: [ConnectionId("connection-19")], version: Version("ics20-1") } } chain=columbus-5
SUCCESS Summary {
    src: PendingPackets {
        unreceived_packets: [
            Sequence(
                2196,
            ),
        ],
        unreceived_acks: [],
    },
    dst: PendingPackets {
        unreceived_packets: [],
        unreceived_acks: [],
    },
}

Um... no. All I can think of is that we need to time this packet out. It's been in the queue for a while.
Running acks,pendig-acks,pending-sends shows empty results. This packet is there to be committed.

As a last hail Mary, we could try to force a timeout packet, let's see:

$ hermes --config manual.toml tx packet-recv --dst-chain sifchain-1 --src-chain columbus-5 --src-port transfer --src-channel channel-7
2022-09-02T23:43:58.599079Z  INFO ThreadId(01) PacketRecvCmd{src_chain=columbus-5 src_port=transfer src_channel=channel-7 dst_chain=sifchain-1}: unreceived packets found: 1
2022-09-02T23:43:58.621206Z  INFO ThreadId(01) PacketRecvCmd{src_chain=columbus-5 src_port=transfer src_channel=channel-7 dst_chain=sifchain-1}: pulled packet data for 0 events; events_total=1 events_left=0
SUCCESS []

But still the packet is listed under commitments. (I tried sending a packet-ack because why not, but that obviously gave an empty "SUCCESS" back.)

So, how should we get to know more about the packet and the necessary steps to send it on its way?

Acceptance Criteria

Commitment queue empty.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere
Copy link
Member

adizere commented Sep 12, 2022

Seems very similar to this issue
#2453

@greg-szabo try using --packet-data-query-height H when you're doing tx packet-recv. The H value should be the height where the packet 2196 was originally created on columbus-5.

@adizere
Copy link
Member

adizere commented Sep 14, 2022

It is indicated that it has been cleared. Older versions of Hermes used to show a list of packets involved in the step, I wish that came back so we know our exact seq.num 2196 was the found packet.

I think we have two problem we can fix on the Hermes logs/return message side:

  • Make it clear that SUCCESS [] means: the command returned succesfully and there is nothing to output, i.e., no events were produced, in this case because no transaction was submitted
  • Make it clear that the command fails to pull packet data for the given unreceived packet it found. This is currently logged ambiguously as follows:

pulled packet data for 0 events; events_total=1 events_left=0

This is technically an error/corner case, not a success.

@adizere adizere added the A: blocked Admin: blocked by another (internal/external) issue or PR label Sep 14, 2022
@adizere
Copy link
Member

adizere commented Sep 20, 2022

Additional context: the underlying problem is from Tendermint, which is overwriting a tx index if it succeeded the first time, and it being fixed here tendermint/tendermint#8625

@adizere
Copy link
Member

adizere commented Oct 6, 2022

Checked with MZ and HS today from Cephalopod and the packets are still stuck. We also checked and one of the clients is expired.

  • Client "07-tendermint-19" on Columbus-5 has a trusting period of 2 weeks and latest header is ~22 hours old -- not expired.
  • Client "07-tendermint-42" on Sifchain has a trusting period of 57600s (0.67 days) and latest header is from Sept 19 -- expired.

We'll try to manually debug the stuck packets once the client is recovered.

Channel ends are:

ChannelEndsSummary {
    chain_id: ChainId {
        id: "columbus-5",
        version: 5,
    },
    client_id: ClientId(
        "07-tendermint-19",
    ),
    connection_id: ConnectionId(
        "connection-19",
    ),
    channel_id: ChannelId(
        "channel-7",
    ),
    port_id: PortId(
        "transfer",
    ),
    counterparty_chain_id: ChainId {
        id: "sifchain-1",
        version: 1,
    },
    counterparty_client_id: ClientId(
        "07-tendermint-42",
    ),
    counterparty_connection_id: ConnectionId(
        "connection-21",
    ),
    counterparty_channel_id: ChannelId(
        "channel-18",
    ),
    counterparty_port_id: PortId(
        "transfer",
    ),
}

@kaisbaccour
Copy link

I have exactly the same issue between juno uni-5 and nois-testnet-003.
13 commitments that I am not able to clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: blocked Admin: blocked by another (internal/external) issue or PR
Projects
None yet
Development

No branches or pull requests

3 participants