Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce max staked streams count to avoid fragmentations #32771

Conversation

lijunwangs
Copy link
Contributor

Problem

We are seeing sporadic performance degradations in the bare metal local cluster bench-tps. Metrics indicate poor and uneven quic network stream performance for both staked_received_chunks on the server side for forwarded packets and num_packets on the client side. And showing spike of stream read timeout on the server for forwarded packets. This is different from regular unstaked node using thin-client with quic. The reason for that too many concurrent streams may contend with the limited receive_window bandwidth set in the connection.

Experiment with reducing the max count to 1024 -- issue still reproducible while setting to 512 shows pretty stable performance over many runs (11+ runs). Results:

https://buildkite.com/solana-labs/solana-local-cluster/builds/738#_

Summary of Changes

Reduce max staked concurrent streams.

Fixes #
#32179

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #32771 (dc4831e) into master (e700dde) will increase coverage by 0.0%.
Report is 1 commits behind head on master.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##           master   #32771   +/-   ##
=======================================
  Coverage    82.0%    82.0%           
=======================================
  Files         785      785           
  Lines      212075   212075           
=======================================
+ Hits       173945   173958   +13     
+ Misses      38130    38117   -13     

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this solves the issue I'd been seeing. Very unlikely we'd have 11+ runs in a row w/o seeing a bad run if the issue was not mitigated.

I wonder if decreasing the maximum number of streams will decrease our maximum throughput? Do we have any benches related to that?

@lijunwangs
Copy link
Contributor Author

It appears this solves the issue I'd been seeing. Very unlikely we'd have 11+ runs in a row w/o seeing a bad run if the issue was not mitigated.

I wonder if decreasing the maximum number of streams will decrease our maximum throughput? Do we have any benches related to that?

Sorry for the late response -- I was side tracked by looking at some other issues.

I have some results using bench-tps using rpc-client which will trigger the send-transaction-service to send packets as staked nodes hence utilizing the QUIC_MAX_STAKED_CONCURRENT_STREAMS parameter. I have not found material differences. With or without the change in my local cluster testing.

Run configuration:

lijun@lijun-dev:~/sol2/solana$ ./cargo run --release --bin solana-bench-tps -- -u http://35.233.177.221:8899 --identity /home/lijun/.config/solana/id.json --tx_count 1000 --thread-batch-sleep-ms 0 -t 20 --duration 30 -n 35.233.177.221:8001 --read-client-keys /home/lijun/gce-keypairs.yaml --use-rpc-client

With change:

Highest TPS: 22553.56 sampling period 1s max transactions: 358857 clients: 1 drop rate: 0.16

[2023-08-15T06:54:28.184476833Z INFO solana_bench_tps::bench] Average TPS: 11562.491

Highest TPS: 24177.71 sampling period 1s max transactions: 363554 clients: 1 drop rate: 0.11

[2023-08-15T06:55:52.398316168Z INFO solana_bench_tps::bench] Average TPS: 11714.341

Highest TPS: 23858.41 sampling period 1s max transactions: 316322 clients: 1 drop rate: 0.26

[2023-08-15T06:56:49.020849338Z INFO solana_bench_tps::bench] Average TPS: 10501.246

Highest TPS: 24459.74 sampling period 1s max transactions: 337255 clients: 1 drop rate: 0.22

[2023-08-15T06:58:42.274724913Z INFO solana_bench_tps::bench] Average TPS: 10866.384

Highest TPS: 17244.26 sampling period 1s max transactions: 354344 clients: 1 drop rate: 0.00

[2023-08-15T07:02:54.095735232Z INFO solana_bench_tps::bench] Average TPS: 11417.363

without change: rpc-client

[2023-08-15T07:11:16.036991107Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 23715.32 | 323173
[2023-08-15T07:11:16.037000744Z INFO solana_bench_tps::bench]
Average max TPS: 23715.32, 0 nodes had 0 TPS
[2023-08-15T07:11:16.037005856Z INFO solana_bench_tps::bench]
Highest TPS: 23715.32 sampling period 1s max transactions: 323173 clients: 1 drop rate: 0.22
[2023-08-15T07:11:16.037011812Z INFO solana_bench_tps::bench] Average TPS: 10751.665

Highest TPS: 23993.92 sampling period 1s max transactions: 308974 clients: 1 drop rate: 0.27

[2023-08-15T07:13:42.854387818Z INFO solana_bench_tps::bench] Average TPS: 10242.592

Highest TPS: 27594.81 sampling period 1s max transactions: 370603 clients: 1 drop rate: 0.14

[2023-08-15T07:14:52.595901036Z INFO solana_bench_tps::bench] Average TPS: 11942.176

Highest TPS: 23327.57 sampling period 1s max transactions: 350066 clients: 1 drop rate: 0.16

[2023-08-15T07:17:19.556543721Z INFO solana_bench_tps::bench] Average TPS: 11271.985

Highest TPS: 21523.34 sampling period 1s max transactions: 307855 clients: 1 drop rate: 0.24

[2023-08-15T07:18:36.675589529Z INFO solana_bench_tps::bench] Average TPS: 10152.27

@lijunwangs lijunwangs force-pushed the investigate_bench_tps_perf_degradation branch from e7298ea to dc4831e Compare August 15, 2023 16:34
@lijunwangs lijunwangs merged commit b44c9bc into solana-labs:master Aug 15, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants