Skip to content
This repository has been archived by the owner on Nov 14, 2023. It is now read-only.

RPC nodes unreliable? chain-cosmos-sdk: Error: post failed: Post "http://...:26657/": EOF #30

Closed
dckc opened this issue Aug 24, 2021 · 13 comments
Assignees
Labels
testnet-problem issues / problems surfaced during testnet

Comments

@dckc
Copy link
Member

dckc commented Aug 24, 2021

Describe the bug

We're seeing a log of EOF / RESET from RPC nodes.

Additional context

possibly a dup of #26

see also: NEW: Conduct performance analysis

@dckc
Copy link
Member Author

dckc commented Aug 24, 2021

@dtribble points out there's a single lock across RPC queries, JS exection, etc.

This is the PR that created the ABCI locking situation cosmos/cosmos-sdk#8549
cosmos/cosmos-sdk#8591

osmosis-labs/osmosis#414 is the osmosis issue that references all the other ones.

@yelllowsin
Copy link

I am getting this same issue, running updated RPC agorictest-17.3:

2021-08-24T17:52:50.688Z chain-cosmos-sdk: Error: post failed: Post "http://161.97.168.77:26657/": EOF

and randomly the error changes to this:

2021-08-24T17:52:57.663Z chain-cosmos-sdk: Error: rpc error: code = NotFound desc = rpc error: code = NotFound desc = egress not found: key not found

My agoric address has both BLD and RUN tokens:

https://testnet.explorer.agoric.net/account/agoric1xsh9py32qyk3lvtavxsgpjh7jry6c7xh6g2lqm

@dckc
Copy link
Member Author

dckc commented Aug 24, 2021

your client retries and continues to make progress, right, @yelllowsin ?

@yelllowsin
Copy link

your client retries and continues to make progress, right, @yelllowsin ?

it keeps retrying yeh, but fails every time with these errors I mentioned. Here is a longer log:

log1.log

@dckc
Copy link
Member Author

dckc commented Aug 25, 2021

Agoric clients (ag-solo, ...) are designed to retry when submitting transactions fails. As long as your client logs continue to show

2021-08-24T16:37:44.867Z chain-cosmos-sdk: delivering to chain (trips=62) ...
...
2021-08-24T16:38:34.926Z chain-cosmos-sdk: Error: post failed: Post "http://167.172.36.124:26657/": EOF
Usage:
...
2021-08-24T16:39:26.404Z chain-cosmos-sdk: delivering to chain (trips=63) ...

every minute or so, then there's a good chance that what you're doing will eventually succeed.

If your client logs go silent for more than a couple minutes, you are probably seeing some other issue.

@dckc
Copy link
Member Author

dckc commented Aug 25, 2021

@michaelfig writes at 10:40pm Chicago time last night:

current state of the RPC nodes: using 10 load-balanced RPC nodes doesn't appear to recover the performance of the ag-solo. It's still slow, and all of those nodes had a hard time keeping up with the chain. It looks like the ABCI lock is really killing us. Regardless of whether the locking issue is resolved, moving to solo IBC (or something like it that uses events better and avoids so many queries to the RPC nodes) will be necessary to reduce each connection's overhead.

Our current solo-to-chain listens for "new block" events (on a WebSocket to an RPC server), and for each new block, queries the ag-solo's on-chain mailbox. That means we have n queries per block given n running ag-solos. IIUC, IBC channels instead publish all packet data as individual RPC server events (pushed to WS clients), and then rely on the freshness of the receiving chain's light client to ensure the packet's data has been committed by the sending chain's voting set.

I don't yet have a complete understanding of how the solo IBC is supposed to work, but it would make sense if we somehow ran an IBC relayer as part of the ag-solo, sending and receiving traffic of the chain directly into the ag-solo (replacing chain-cosmos-sdk.js).
So the reduction in queries is because the IBC events are caused by actual traffic rather than polling once every block "just in case" some traffic was sent.

@ASergijenko
Copy link

ASergijenko commented Aug 25, 2021

We have the same issue, it is always try to connect to the remote server but no luck

root@template:~/testnet-load-generator# agoric start testnet 8000 https://testnet.agoric.net/network-config
agoric: start: /root/agoric-sdk/packages/solo/src/entrypoint.js --webport=8000 setup --netconfig=https://testnet.agoric.net/network-config
downloading netconfig from https://testnet.agoric.net/network-config
Already have an entry for 80280a37099594d8be7323f30a7c757535bd014c9f5198303b3f866d6656e41a; not replacing
2021-08-25T17:07:47.298Z web: Listening on 127.0.0.1:8000
2021-08-25T17:08:04.186Z chain-cosmos-sdk: Error: post failed: Post "http://178.128.51.171:26657/": EOF
Usage:
ag-chain-cosmos query swingset egress [account] [flags]

Flags:
--height int Use a specific height to query state at (this can error if the node is pruning state)
-h, --help help for egress
--node string : to Tendermint RPC interface for this chain (default "tcp://localhost:26657")
-o, --output string Output format (text|json) (default "text")

Global Flags:
--chain-id string The network chain ID
--home string directory for config and data (default "/root/.ag-cosmos-helper")
--log_format string The logging format (json|plain) (default "plain")
--log_level string The logging level (trace|debug|info|warn|error|fatal|panic) (default "info")
--trace print out full stack trace on errors
2021-08-25T17:08:04.186Z chain-cosmos-sdk:

agorictest-17 chain does not yet know of address agoric14gqjcpv0l7dx407944zzgn087vl3f4tl82jg42

Send:

!faucet client agoric14gqjcpv0l7dx407944zzgn087vl3f4tl82jg42

to the appropriate faucet channel on Discord (https://agoric.com/discord)

@dckc dckc added the testnet-problem issues / problems surfaced during testnet label Aug 25, 2021
@unordered-set
Copy link

I tried to connect it even to the local RPC node, but no luck

@dckc
Copy link
Member Author

dckc commented Aug 26, 2021

@unordered-set could you give more details? What command did you use to connect to the local RPC node? What response did you get? How is your RPC node set up?

@unordered-set
Copy link

@dckc Thanks for helping, Dan!

UPD: turned out that port to connect was 8001, not 8000, so I just had to change my ssh tunnel. And it worked. Sorry

The node still tries to also connect to remote host:

Not sure it is worth doing, but what I did:

  1. Created custom network-config file and placed it to the place visible to container
cat ~/.agoric/network-config
{
  "chainName": "agorictest-17",
  "gci": "80280a37099594d8be7323f30a7c757535bd014c9f5198303b3f866d6656e41a",
  "peers": [
    "[email protected]:26656",
    "[email protected]:26656",
    "[email protected]:26656",
    "[email protected]:26656"
  ],
  "rpcAddrs": [
    "127.0.0.1:26657"
  ],
  "seeds": [
    "[email protected]:26656",
    "[email protected]:26656"
  ]
}
  1. made sure that I'm running docker-compose in the network=host mode, and my local RCP is accessible:
# docker exec -it agoricsdk_ag-solo_1 /bin/bashnic-64-minimal ~ #
(inside docker) root@Ubuntu-1804-bionic-64-minimal:/data/solo# curl -v 127.0.0.1:26657
* Expire in 0 ms for 6 (transfer 0x5570d522af90)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5570d522af90)
* Connected to 127.0.0.1 (127.0.0.1) port 26657 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:26657
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html
< X-Server-Time: 1629939716
< Date: Thu, 26 Aug 2021 01:01:56 GMT
< Transfer-Encoding: chunked
<
<html><body><br>Available endpoints:<b...
  1. So, I think RPC config was changed, endpoint is accessible but I still not able to connect to my node:
curl -v 127.0.0.1:8000
* Rebuilt URL to: 127.0.0.1:8000/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* connect to 127.0.0.1 port 8000 failed: Connection refused
* Failed to connect to 127.0.0.1 port 8000: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused

and also in logs:

 2021-08-26T01:06:38.848Z chain-cosmos-sdk: Error: post failed: Post "http://164.90.247.20:26657": EOF

but because it works, not sure if I should be bothered.

@dckc
Copy link
Member Author

dckc commented Sep 1, 2021

Agoric/agoric-sdk#3763 is a plan to improve the ABCI locking situation.

@michaelfig
Copy link
Member

Agoric/agoric-sdk#3763 (comment) points to my analysis of Tendermint locking, and a perf test of a simple change to allow parallelism again.

@dckc dckc added this to the beta-globulin milestone Sep 7, 2021
@michaelfig
Copy link
Member

I've verified that this failure is a consequence of a stalled query due to the RPC locking. The Agoric/agoric-sdk#3805 PR fixes this.

Marking as closed for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testnet-problem issues / problems surfaced during testnet
Projects
None yet
Development

No branches or pull requests

5 participants