Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive Bandwidth Consumption Post v36.0 Node Upgrade #231

Closed
88plug opened this issue Jun 21, 2024 · 10 comments
Closed

Excessive Bandwidth Consumption Post v36.0 Node Upgrade #231

88plug opened this issue Jun 21, 2024 · 10 comments
Assignees
Labels
P0 repo/node Akash node repo issues

Comments

@88plug
Copy link

88plug commented Jun 21, 2024

Description:
Since upgrading to Akash node release v36.0, nodes have been consuming an unusually high amount of bandwidth, far exceeding the previous usage. This has resulted in "out of bandwidth" notifications across multiple nodes in various datacenters, as well as noticeable lag on residential networks. No changes were made to the default deployment code.

To Reproduce:

  1. Deploy an Akash node using the default deployment code.
  2. Monitor the bandwidth usage of the node.

Expected Behavior:
The node should have a sustained bandwidth usage of approximately 5,100,000 BPS (5.1 Mbps) for incoming and 6,000,000 BPS (6 Mbps) for outgoing traffic.

Traffic Analysis:
Upon reviewing the provided screenshot:

  • Pre-Upgrade Bandwidth Usage:
    • Incoming bandwidth: Approximately 5,100,000 BPS (5.1 Mbps).
    • Outgoing bandwidth: Approximately 6,000,000 BPS (6 Mbps).
  • Post-Upgrade Bandwidth Usage: There is a significant spike in bandwidth consumption.
    • Outgoing bandwidth: Approximately 600,000,000 BPS (600 Mbps).
    • Incoming bandwidth: Approximately 50,000,000 BPS (50 Mbps).

Attempted Fixes:
I have attempted to limit the P2P connections and adjust the send_rate and recv_rate parameters in the Cosmos SDK configuration. Despite these efforts, the issue persists.

Request:
Please examine the issue deeper and push a fix to stop the irregular bandwidth consumption.

Recommendation:
Anyone running an Akash node should check their bandwidth consumption and traffic to ensure they are not affected by this issue. Create point release for v36.0 that stops excessive bandwidth consumption.

Monthly Traffic View:
image

Monthly Traffic View:

image

Before Upgrade Daily:

image

After Upgrade Daily:

image

Additional Context:
This issue is critical as it affects the performance and reliability of the nodes across various datacenters and residential networks. Immediate attention and resolution are required.

@chainzero chainzero added repo/node Akash node repo issues P0 and removed awaiting-triage labels Jun 21, 2024
@chainzero
Copy link
Collaborator

@88plug - could you please confirm:

1). Are these nodes being built via Akash Helm Charts? Asking because the Helm Charts sets - minimum_gas_prices: 0.025uakt - and want to ensure this setting is in place in affected nodes.

2). During node start up - are there any log entries regarding 0 gas prices?

@chainzero
Copy link
Collaborator

Review from additional node operator impacted by increased P2P traffic:

  • Nodes using both CLI and Helm Chart builds are experiencing heightened traffic

  • The CLI node build has minimum_gas_prices: 0.025uakt setting in app.toml

  • Helm Chart default values were not changed and thus should have minimum_gas_prices: 0.025uakt

  • The bandwidth is NOT increasing further over time. I.e. the P2P bandwidth rose considerably a few days ago and has been steady at that level since.

  • No evidence of 0 gas fees in node logs but logs are littered with failed to add vote errors such as:

Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)

@c29r3
Copy link

c29r3 commented Jun 22, 2024

Review from additional node operator impacted by increased P2P traffic:

  • Nodes using both CLI and Helm Chart builds are experiencing heightened traffic
  • The CLI node build has minimum_gas_prices: 0.025uakt setting in app.toml
  • Helm Chart default values were not changed and thus should have minimum_gas_prices: 0.025uakt
  • The bandwidth is NOT increasing further over time. I.e. the P2P bandwidth rose considerably a few days ago and has been steady at that level since.
  • No evidence of 0 gas fees in node logs but logs are littered with failed to add vote errors such as:
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)

It seems that this issue is observed in other networks as well, for example, in the Sentinel Network
image
https://x.com/zeroservices_eu/status/1784553362316288174

I'm not sure exactly how this problem arises, but it seems that it spreads through specific peers (full node\RPC)
image
image

grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/config.toml

image

grep -A 2 -B 2 "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json
image

I'm not sure why this P2P address appears in the address book 123 times 😳

grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json | wc -l
123

addrbook.json

@troian
Copy link
Member

troian commented Jun 22, 2024

@c29r3 can you backup your addrbook and try one from polkachu

@c29r3
Copy link

c29r3 commented Jun 22, 2024

@c29r3 can you backup your addrbook and try one from polkachu

Done, but err="error adding vote" still exists

Here is the traffic for the last 48 hours
image

@c29r3
Copy link

c29r3 commented Jun 22, 2024

I enabled the --log_level debug mode and saved logs for the last 20 minutes from my RPC node

sudo journalctl -u akash.service --no-hostname --since "20 minutes ago" | grep -v p2p > akash_20min_log.txt

https://snapshots.c29r3.xyz/akash/akash_20min_log_debug.zip

@88plug
Copy link
Author

88plug commented Jun 24, 2024

Fixes excessive bandwidth #285

I did a battery of tests over the weekend and was able to resolve the issue.

The issue appears to be with the p2p seed_mode is set to true for the node in the Helm charts.

Cosmos default is pex true and seed mode false.

I have updated the Helm charts and tested with seed_mode disabled and the excessive bandwidth issue is resolved.

For reference in my testing I also found "error adding vote" will show with 0.025uakt fee. So that may indict some other issue, but it was not related to the bandwidth.

@chainzero
Copy link
Collaborator

chainzero commented Jun 24, 2024

Issue was caused by IBC relayers allowing zero/very low gas TXs onto the network and into mempool. While Akash RPC/validators are universally configured to reject zero gas TXs, a number of IBC relayers were not configured to reject these TXs.

Issue was resolved by:

1). Specific validators intentionally set their min gas requirement to zero to allow these TXs to be written to the chain and thus cleansing the validator mempools of such TXs.

2). Worked with current IBC relayers to ensure they have min gas settings.

Network P2P traffic is now normalized.

@troian
Copy link
Member

troian commented Jun 27, 2024

@Krewedk0 It's not quite correct.

  1. All validators had and have minimum gas fees >0
  2. There were a few relayers using default configuration from chain registry which had minimum gas fees set 0.
  3. Due to IBC v4 design, transactions coming via IBC with 0 or very small gas fees could enter mempool and stuck there forever, because:
  • all validators has gas fees set to correct level
  • tx recheck in cosmos SDK v0.45.x does not work correctly, even tho transactions were expired, they still stayed in the mempool.

@Krewedk0
Copy link

Krewedk0 commented Jun 28, 2024

@troian Deleted my last comment to not give bad people good ideas. But i ran some tests last night and you can actually do very nasty stuff with the setup i mentioned.
Also Chandra Station and 16psyche still have 0 min gas prices set on their validator nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 repo/node Akash node repo issues
Projects
None yet
Development

No branches or pull requests

5 participants