Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(perf): continuosly measure on single conn (iperf-style) #276

Merged
merged 32 commits into from
Oct 25, 2023

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Aug 24, 2023

Our current throughput tests open a connection, open a stream, up- or download 100MB and close the connection. 100 MB is not enough on the given path (60ms, ~5gbit/s) to exit congestion controller's slow-start. See #261 for details.

Instead of downloading 100MB multiple times, each on a new connection, establish a single connection and continuously measure the throughput for a fixed duration (20s).

Closes #261

Our current throughput tests open a connection, open a stream,
up- or download 100MB and close the connection. 100 MB is not enough on the
given path (60ms, ~5gbit/s) to exit congestion controller's slow-start. See
#261 for details.

Instead of downloading 100MB multiple times, each on a new connection, establish
a single connection and continuously measure the throughput for a fixed
duration (60s).
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do 3 iterations at 20 seconds? Slow start won't take more than 1-2s, so this should give us plenty of time to converge, and it would show situations where the congestion controllers runs into a state that it takes long to recover from (sudden cross-traffic).

Comment on lines 63 to 75
// TODO
jsonB, err := json.Marshal(Result{
TimeSeconds: time.Since(r.LastReportTime).Seconds(),
UploadBytes: uint(r.lastReportRead),
Type: "intermediary",
})
if err != nil {
log.Fatalf("failed to marshal perf result: %s", err)
}
fmt.Println(string(jsonB))

r.LastReportTime = time.Now()
r.lastReportRead = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do a single call to time.Now(), so we don't lose any bytes sent between the two calls:

Suggested change
// TODO
jsonB, err := json.Marshal(Result{
TimeSeconds: time.Since(r.LastReportTime).Seconds(),
UploadBytes: uint(r.lastReportRead),
Type: "intermediary",
})
if err != nil {
log.Fatalf("failed to marshal perf result: %s", err)
}
fmt.Println(string(jsonB))
r.LastReportTime = time.Now()
r.lastReportRead = 0
now := time.Now()
// TODO
jsonB, err := json.Marshal(Result{
TimeSeconds: r.LastReportTime.Sub(now).Seconds(),
UploadBytes: uint(r.lastReportRead),
Type: "intermediary",
})
if err != nil {
log.Fatalf("failed to marshal perf result: %s", err)
}
fmt.Println(string(jsonB))
r.LastReportTime = now
r.lastReportRead = 0

@marten-seemann
Copy link
Contributor

Preliminary results show fruitful for https but not rust-libp2p/quic.

Are these results available anywhere?

@mxinden
Copy link
Member Author

mxinden commented Aug 25, 2023

Preliminary results show fruitful for https but not rust-libp2p/quic.

Are these results available anywhere?

Not yet. Still in a very work-in-progress state.

@mxinden
Copy link
Member Author

mxinden commented Aug 25, 2023

Reaching 4.5 gbit/s with https and >5 gbit/s with rust-libp2p. Still testing. Though this is looking promising.

This was referenced Aug 31, 2023
@mxinden
Copy link
Member Author

mxinden commented Sep 4, 2023

iperf throughput mismatch was due to Nagle's algorithm. Disabled now via -N. See previous commit. (Default for the https implementation already golang/go#57530.)

newplot(10)

https://observablehq.com/d/682dcea9fe2505c4?branch=perf-exit-slow-start#branch

I still need to investigate more for the other measurements (*-libp2p and quic-go). Will try with a fixed MTU of 1500 next.

//CC @marten-seemann

@marten-seemann
Copy link
Contributor

marten-seemann commented Sep 4, 2023

iperf throughput mismatch was due to Nagle's algorithm

Interesting! It's great to see iperf and HTTPS achieving roughly similar results (at least in the limit). This means that our setup is getting more trustworthy!

Looking at the graphs, why are some measurements drawn as boxes and some as points? Why do some have error bars and others don't? The spread seems pretty high, do we need more iterations?

I wouldn't be surprised if quic-go maxed out somewhere around 2 Gbps. At some point, your transfer becomes CPU-limited, depending on the number of kernel offloads that your QUIC stack uses (and that's not the thing we want to benchmark here). That said, I just updated quic-go/perf to quic-go v0.38.1 (quic-go/perf#16, I'll merge the PR once GHA is not broken anymore...), which uses GSO by default. Might be worth rebasing your branch to see if this changes anything.

In go-libp2p, Yamux uses a 16 MB receive window, which should limit us to roughly 2 Gbps (minus some muxer overhead). It's interesting to see that we're achieving roughly half of that. Could be a coincidence, or point to a bug in our flow control autotuning. I'd be happy to debug this using the current setup (assuming I can still run it manually as I could with the version on master), please let me know.

QUIC uses a 10 MB window, which limits the bandwidth to 1.25 Gbps. That means we're not quite at the optimum, but pretty close. Would it be helpful for you if we prioritize resolving libp2p/go-libp2p#2290? Alternatively, we could also just have a go-libp2p branch that bumps that value, so we can see if that's actually the root of the problem.

Will try with a fixed MTU of 1500 next.

Does AWS allow larger MTUs on their backbone? That would indeed give TCP an unfair advantage over at least quic-go. Have you verified that using tcpdump / Wireshark?

@marten-seemann
Copy link
Contributor

Here are some interesting result from running the HTTPS test and analyzing the tcpdump. The congestion controller used for this test is Cubic. First interesting result: importing and processing an 8 GB pcap into Wireshark takes a pretty long time O(30min) ;)

Here's the RTT distribution:
Screenshot 2023-09-07 at 22 59 31
There's definitely some queues building up in the network, but it's not too terrible.

Here's the sequence plot (ignore the wrapping of the packet number, obviously), showing the time when packet loss occurred:
Screenshot 2023-09-07 at 15 51 10

And here's the throughput:
Screenshot 2023-09-07 at 15 01 25

Obviously, we're very far from reaching a steady state.

Here's some back-of-the-envelope math to calculate the recovery time (i.e. the time it takes to ramp up the congestion window to its original size after a loss), and assuming a BDP (at 5 gbps and 65ms RTT) of roughly 40 MB:

  • On Reno, a packet loss halves the cwnd, and every round-trip without a loss event increases it by one MSS. Thus the recovery time is (20 MB / 1400 bytes) ~14000 RTTs, which is 15 minutes (!).
  • On Cubic, it's harder to estimate. The L4S Prague paper claims that at 100 mbps the recovery time is 250 RTTs, and doubles for every 8x increase in bandwidth. That's roughly 1000 RTTs, which is 1 minute.

In the sequence plot above, we see packet loss happening 10x as frequently as this calculation suggests. This might be due to the more shallow buffer, but I don't know precisely how the recovery time scales with the buffer size.


What does this mean for our perf setup? At the BDP that we chose for our test, we're running into limitations imposed by the congestion controllers:

  • With a recovery time of 15 mintues, Reno has no chance against Cubic whatsoever, yet this is what RFC 9002 recommends for QUIC implementations and what major players have deployed in their QUIC stacks.
  • Even Cubic's recovery time is pretty long. At a sampling frequency of 1s, we will pick up the saw-tooth pattern inherent to the congestion controller, and we will inevitably see a wide spread of measurement results.

@mxinden
Copy link
Member Author

mxinden commented Sep 15, 2023

Will try with a fixed MTU of 1500 next.

Does AWS allow larger MTUs on their backbone? That would indeed give TCP an unfair advantage over at least quic-go. Have you verified that using tcpdump / Wireshark?

Turns out, it does not:

[ec2-user@ip-]$ ping -M do -s 1472 -c 4 x.x.x.x
PING  () 1472(1500) bytes of data.
1480 bytes from : icmp_seq=1 ttl=109 time=63.4 ms
1480 bytes from : icmp_seq=2 ttl=109 time=63.4 ms
1480 bytes from : icmp_seq=3 ttl=109 time=63.4 ms
1480 bytes from : icmp_seq=4 ttl=109 time=63.4 ms

---  ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 63.384/63.408/63.441/0.024 ms
[ec2-user@ip-]$ ping -M do -s 1500 -c 4 x.x.x.x
PING  () 1500(1528) bytes of data.
From  icmp_seq=1 Frag needed and DF set (mtu = 1500)
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500

---  ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3086ms

perf/README.md Show resolved Hide resolved
@mxinden
Copy link
Member Author

mxinden commented Sep 20, 2023

Looking at the graphs, why are some measurements drawn as boxes and some as points?

The Box visualizes Q1 to Q3 with Q2 (median) denoted with a line within the box. The lines are the whiskers, representing Q0 (minimum) and Q4 (maximum). The dots represent outliers.

https://en.wikipedia.org/wiki/Box_plot has a good explanation for each of these.

Why do some have error bars and others don't?

I am not sure what you refer to with "error bars" @marten-seemann.

The spread seems pretty high, do we need more iterations?

I decreased each measurement duration to 20 seconds and increased the iterations per implementation and transport to 10. I triggered a new CI run to update our benchmark-results.json.

https://github.com/libp2p/test-plans/actions/runs/6250828757/job/16970551159

@mxinden
Copy link
Member Author

mxinden commented Sep 20, 2023

I wouldn't be surprised if quic-go maxed out somewhere around 2 Gbps. At some point, your transfer becomes CPU-limited, depending on the number of kernel offloads that your QUIC stack uses (and that's not the thing we want to benchmark here). That said, I just updated quic-go/perf to quic-go v0.38.1 (quic-go/perf#16, I'll merge the PR once GHA is not broken anymore...), which uses GSO by default. Might be worth rebasing your branch to see if this changes anything.

👍 Note that I merged current master into quic-go/perf#17 and updated the reference here. Thus with the next update to benchmark-results.json we will see the impact of GSO.

@mxinden
Copy link
Member Author

mxinden commented Sep 20, 2023

In go-libp2p, Yamux uses a 16 MB receive window, which should limit us to roughly 2 Gbps (minus some muxer overhead). It's interesting to see that we're achieving roughly half of that. Could be a coincidence, or point to a bug in our flow control autotuning. I'd be happy to debug this using the current setup (assuming I can still run it manually as I could with the version on master), please let me know.

Indeed surprising. You can still run it manually. Please go ahead. Thank you @marten-seemann.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 20, 2023

@mxinden
Copy link
Member Author

mxinden commented Sep 21, 2023

I have updated the forked dashboard to the latest data format:

https://observablehq.com/d/682dcea9fe2505c4?branch=27d07a6f47c2bc1a9c9d9a9f6626b536248284f5

Copy link
Member Author

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marten-seemann @sukunrt can either of you give the perf/impl/https and perf/impl/go-libp2p changes a review?

Comment on lines 38 to 32
{
id: "v0.46",
implementation: "js-libp2p",
transportStacks: ["tcp"]
}
// {
// id: "v0.46",
// implementation: "js-libp2p",
// transportStacks: ["tcp"]
// }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in libp2p/js-libp2p#2067.

perf/runner/src/versions.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the go-libp2p implementation.

perf/impl/go-libp2p/v0.29/main.go Outdated Show resolved Hide resolved
perf/impl/go-libp2p/v0.29/perf.go Outdated Show resolved Hide resolved
perf/impl/go-libp2p/v0.29/perf.go Outdated Show resolved Hide resolved
perf/impl/go-libp2p/v0.29/perf.go Outdated Show resolved Hide resolved
@mxinden mxinden marked this pull request as ready for review October 19, 2023 09:38
@github-actions
Copy link
Contributor

@mxinden
Copy link
Member Author

mxinden commented Oct 23, 2023

Unless there are any objections, I plan to merge here once libp2p/rust-libp2p#4382 is merged.

@mxinden
Copy link
Member Author

mxinden commented Oct 25, 2023

The last commit removes js-libp2p. Once libp2p/js-libp2p#2067 is merged, we can re-introduce it here.

@mxinden mxinden merged commit 0a8dbab into master Oct 25, 2023
@mxinden mxinden deleted the perf-exit-slow-start branch October 25, 2023 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: throughput test (TCP, QUIC, libp2p, but not iperf) never exits slow start
3 participants