feat(udp): use sendmmsg and recvmmsg #1741

mxinden · 2024-03-13T14:10:53Z

Write and read up to BATCH_SIZE datagrams with single sendmmsg and recvmmsg syscall.

Previously neqo_bin::udp::Socket::send would use sendmmsg, but provide a single buffer to write into only, effectively using sendmsg instead of sendmmsg. Same with Socket::recv.

With this commit Socket::send provides BATCH_SIZE number of buffers on each sendmmsg syscall, thus writing more than one datagram at a time if available. Same with Socket::recv.

Part of #1693.

github-actions · 2024-03-15T10:15:11Z

Benchmark results

Performance differences relative to 3e53709.

drain a timer quickly time: [434.66 ns 441.22 ns 447.54 ns]
change: [-0.6964% +0.8250% +2.9072%] (p = 0.38 > 0.05)
No change in performance detected.
coalesce_acked_from_zero 1+1 entries
time: [196.62 ns 197.11 ns 197.61 ns]
change: [-0.8171% -0.4829% -0.1380%] (p = 0.01 < 0.05)
Change within noise threshold.
coalesce_acked_from_zero 3+1 entries
time: [240.43 ns 241.00 ns 241.64 ns]
change: [-0.5382% -0.0751% +0.3426%] (p = 0.76 > 0.05)
No change in performance detected.
coalesce_acked_from_zero 10+1 entries
time: [239.00 ns 239.69 ns 240.53 ns]
change: [-1.3511% -0.5261% +0.0766%] (p = 0.17 > 0.05)
No change in performance detected.
coalesce_acked_from_zero 1000+1 entries
time: [216.68 ns 216.96 ns 217.27 ns]
change: [-0.7531% -0.1853% +0.3917%] (p = 0.55 > 0.05)
No change in performance detected.
RxStreamOrderer::inbound_frame()
time: [119.87 ms 119.94 ms 120.00 ms]
change: [+0.4749% +0.6968% +0.8697%] (p = 0.00 < 0.05)
Change within noise threshold.
transfer/Run multiple transfers with varying seeds
time: [118.08 ms 118.39 ms 118.71 ms]
thrpt: [33.696 MiB/s 33.786 MiB/s 33.875 MiB/s]
change:
time: [-3.4361% -3.1033% -2.7586%] (p = 0.00 < 0.05)
thrpt: [+2.8369% +3.2027% +3.5584%]
Change within noise threshold.
transfer/Run multiple transfers with the same seed
time: [118.90 ms 119.07 ms 119.25 ms]
thrpt: [33.542 MiB/s 33.594 MiB/s 33.642 MiB/s]
change:
time: [-2.7935% -2.5730% -2.3522%] (p = 0.00 < 0.05)
thrpt: [+2.4089% +2.6410% +2.8738%]
Change within noise threshold.
1-conn/1-100mb-resp (aka. Download)/client
time: [1.0852 s 1.0898 s 1.0946 s]
thrpt: [91.361 MiB/s 91.760 MiB/s 92.150 MiB/s]
change:
time: [-2.3395% -0.9694% +0.1106%] (p = 0.16 > 0.05)
thrpt: [-0.1105% +0.9789% +2.3956%]
No change in performance detected.
1-conn/10_000-parallel-1b-resp (aka. RPS)/client
time: [542.73 ms 546.39 ms 550.03 ms]
thrpt: [18.181 Kelem/s 18.302 Kelem/s 18.425 Kelem/s]
change:
time: [+26.586% +27.556% +28.610%] (p = 0.00 < 0.05)
thrpt: [-22.245% -21.603% -21.002%]
💔 Performance has regressed.
1-conn/1-1b-resp (aka. HPS)/client
time: [52.299 ms 52.653 ms 53.052 ms]
thrpt: [18.849 elem/s 18.992 elem/s 19.121 elem/s]
change:
time: [+2.6478% +3.6365% +4.7018%] (p = 0.00 < 0.05)
thrpt: [-4.4906% -3.5089% -2.5795%]
💔 Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
msquic	msquic			671.8 ± 293.0	379.2	1321.6	1.00
neqo	msquic	reno	on	908.9 ± 211.4	713.1	1268.5	1.00
neqo	msquic	reno		873.0 ± 170.4	713.5	1221.6	1.00
neqo	msquic	cubic	on	958.8 ± 398.9	697.1	1869.5	1.00
neqo	msquic	cubic		832.3 ± 127.4	710.1	1009.0	1.00
msquic	neqo	reno	on	3230.5 ± 230.0	2967.8	3778.0	1.00
msquic	neqo	reno		3232.8 ± 200.1	2888.6	3582.1	1.00
msquic	neqo	cubic	on	3478.9 ± 224.7	3270.5	3986.4	1.00
msquic	neqo	cubic		3262.9 ± 74.2	3161.2	3363.7	1.00
neqo	neqo	reno	on	2627.0 ± 131.2	2445.5	2849.8	1.00
neqo	neqo	reno		2537.0 ± 227.2	2377.2	3146.6	1.00
neqo	neqo	cubic	on	3032.9 ± 272.3	2734.0	3479.1	1.00
neqo	neqo	cubic		2626.5 ± 205.8	2403.5	3060.2	1.00

⬇️ Download logs

codecov · 2024-03-15T11:38:10Z

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 93.22%. Comparing base (3e53709) to head (38fca6b).

Files	Patch %	Lines
neqo-http3/src/send_message.rs	50.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1741      +/-   ##
==========================================
- Coverage   93.26%   93.22%   -0.04%     
==========================================
  Files         110      110              
  Lines       35669    35680      +11     
==========================================
- Hits        33266    33262       -4     
- Misses       2403     2418      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mxinden · 2024-03-15T18:05:04Z

Status update:

Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof.

Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces.

Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome.

larseggert · 2024-03-15T21:22:18Z

Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches.

As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too.

mxinden · 2024-03-16T13:28:57Z

Thank you for the input.

You might want to log how large the batches are that you are actually doing on TX/RX

Will do.

Remember that pacing interferes here.

It is actually disabled by default, though by mistake. Thanks for the hint! See #1753.

Read up to `BATCH_SIZE = 32` with single `recvmmsg` syscall. Previously `neqo_bin::udp::Socket::recv` would use `recvmmsg`, but provide a single buffer to write into only, effectively using `recvmsg` instead of `recvmmsg`. With this commit `Socket::recv` provides `BATCH_SIZE` number of buffers on each `recvmmsg` syscall, thus reading more than one datagram at a time if available.

mxinden · 2024-04-03T17:11:23Z

Small update:

Solely using recvmmsg (this pull request) without sendmmsg or any buffer optimizations does not show any performance improvements. Quite the opposite, our benchmarks report regressions.

I hacked together a sendmmsg implementation (not yet pushed). The results (sendmmsg + recvmmsg) look promising:

➜  neqo-bin git:(main) ✗ critcmp main sendmmsg                                            
group                                          main                                     sendmmsg
-----                                          ----                                     --------
1-conn/1-100mb-resp (aka. Download)/client     1.09    541.4±5.05ms   184.7 MB/sec      1.00  494.7±213.29ms   202.1 MB/sec

larseggert · 2024-04-03T17:41:33Z

recvmmsg without sendmmsg will have no benefit, since without the latter batches will be very short.

mxinden · 2024-04-04T08:47:19Z

1-conn/1-100mb-resp (aka. Download)/client
time: [781.71 ms 787.17 ms 792.92 ms]
thrpt: [126.12 MiB/s 127.04 MiB/s 127.93 MiB/s]
change:
time: [-30.837% -28.888% -26.932%] (p = 0.00 < 0.05)
thrpt: [+36.858% +40.623% +44.586%]
💚 Performance has improved.

🎉 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI.

mxinden · 2024-04-05T10:01:56Z

1-conn/10_000-1b-seq-resp (aka. RPS)/client
time: [446.15 ms 448.43 ms 450.77 ms]
thrpt: [22.184 Kelem/s 22.300 Kelem/s 22.414 Kelem/s]
change:
time: [+14.829% +15.832% +16.877%] (p = 0.00 < 0.05)
thrpt: [-14.440% -13.668% -12.914%]
💔 Performance has regressed.

100-seq-conn/1-1b-resp (aka. HPS)/client
time: [3.6042 s 3.6064 s 3.6087 s]
thrpt: [27.711 elem/s 27.728 elem/s 27.745 elem/s]
change:
time: [+6.1249% +6.2317% +6.3299%] (p = 0.00 < 0.05)
thrpt: [-5.9531% -5.8662% -5.7714%]
💔 Performance has regressed.

Currently investigating why performance regresses here. It does not seem to be tied to the BATCH_SIZE. Setting BATCH_SIZE to 1, effectively disabling sendmmsg and recvmmsg, does not improve the RPS benchmark:

➜  neqo-bin git:(recvmmsg) ✗ critcmp main recvmmsg-batch-32 recvmmsg-batch-1 -f "RPS"                            
group                                          main                                   recvmmsg-batch-1                       recvmmsg-batch-32
-----                                          ----                                   ----------------                       -----------------
1-conn/10_000-1b-seq-resp (aka. RPS)/client    1.00   122.8±16.81ms 79.5 KElem/sec    1.24   152.6±17.01ms 64.0 KElem/sec    1.24   152.6±23.00ms 64.0 KElem/sec

`<SendMessage as Sendstream>::send_data` attempts to send a slice of data down into the QUIC layer, more specifically `neqo_transport::Connection::stream_send_atomic`. While it attempts to send any existing buffered data at the http3 layer first, it does not itself fill the http3 layer buffer, but instead only sends data, if the lower QUIC layer has capacity, i.e. only if it can send the data down to the QUIC layer right away. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L168-L221 `<SendMessage as Sendstream>::send_data` is called via `Http3ServerHandler::send_data`. The wrapper first marks the stream as `stream_has_pending_data`, marks itself as `needs_processing` and then calls down into `<SendMessage as Sendstream>::send_data`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/connection_server.rs#L51-L74 Thus the latter always marks the former as `stream_has_pending_data` even though the former never writes into the buffer and thus might actually not have pending data. Why is this problematic? 1. Say that the buffer of the `BufferedStream` of `SendMessage` is empty. 2. Say that the user attempts to write data via `Http3ServerHandler::send_data`. Despite not filling the http3 layer buffer, the stream is marked as `stream_has_pending_data`. 3. Say that next the user calls `Http3Server::process`, which will call `Http3Server::process_http3`, which will call `Http3ServerHandler::process_http3`, which will call `Http3Connection::process_sending`, which will call `Http3Connection::send_non_control_streams`. `Http3Connection::send_non_control_streams` will attempt to flush all http3 layer buffers of streams marked via `stream_has_pending_data`, e.g. the stream from step (2). Thus it will call `<SendMessage as SendStream>::send` (note `send` and not the previous `send_data`). This function will attempt the stream's http3 layer buffer. In the case where the http3 layer stream buffer is empty, it will enqueue a `DataWritable` event for the user. Given that the buffer of our stream is empty (see (1)) such `DataWritable` event is always emitted. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L236-L264 The user, on receiving the `DataWritable` event will attempt to write to it via `Http3ServerHandler::send_data`, back to step (2), thus closing the busy loop. How to fix? This commit adds an additional check to the `has_pending_data` function to ensure it indeed does have pending data. This breaks the above busy loop.

This reverts commit 5d314b6.

`neqo_transport::Connection` offers 3 process methods: - `process` - `process_output` - `process_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). Extracted from mozilla#1741.

`neqo_transport::Connection` offers 4 process methods: - `process` - `process_output` - `process_input` - `process_multiple_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Where `process_input` delegates to `process_multiple_input`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_multiple_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). By using `process_multiple_input` instead of `process` or `process_input`, auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input datagram batch, and not for each input datagram. Extracted from mozilla#1741.

* refactor(client): use process_output and process_multiple_input `neqo_transport::Connection` offers 4 process methods: - `process` - `process_output` - `process_input` - `process_multiple_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Where `process_input` delegates to `process_multiple_input`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_multiple_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). By using `process_multiple_input` instead of `process` or `process_input`, auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input datagram batch, and not for each input datagram. Extracted from #1741. * process_output before handle * process_ouput after each input batch

Always use up send space on QUIC layer to ensure receiving `ConnectionEvent::SendStreamWritable` event when more send space is available. See mozilla#1819 for details. This commit implements option 2. Fixes mozilla#1819.

larseggert · 2024-06-18T10:51:38Z

@mxinden is this OBE now?

mxinden · 2024-07-01T18:39:20Z

Sorry for the delay here. I plan to extract parts of this pull request for #1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch.

mxinden force-pushed the recvmmsg branch 3 times, most recently from 1995639 to c06253f Compare March 15, 2024 09:50

mxinden force-pushed the recvmmsg branch from c06253f to 35efc39 Compare March 15, 2024 11:20

mxinden force-pushed the recvmmsg branch 3 times, most recently from 08b4f32 to a2f2d82 Compare March 15, 2024 16:37

mxinden mentioned this pull request Mar 18, 2024

perf(bin): add criterion benchmarks #1758

Merged

mxinden force-pushed the recvmmsg branch 3 times, most recently from a8a1b80 to 4427469 Compare March 21, 2024 11:45

mxinden force-pushed the recvmmsg branch from 4427469 to cfcea21 Compare March 21, 2024 11:52

mxinden mentioned this pull request Mar 24, 2024

Explore multi-packet I/O (aka. fast UDP I/O) #1693

Open

7 tasks

mxinden added 2 commits April 3, 2024 11:39

Merge branch 'main' of https://github.com/mozilla/neqo into recvmmsg

e2e54fa

Clippy

1a813e4

Minimal sendmmsg impl

b6cd481

mxinden added 2 commits April 4, 2024 11:01

Use BATCH_SIZE

22da54f

Merge branch 'main' of https://github.com/mozilla/neqo into recvmmsg

3f9eb9c

mxinden changed the title ~~feat(udp): use recvmmsg~~ feat(udp): use sendmmsg and recvmmsg Apr 4, 2024

mxinden added 2 commits April 4, 2024 13:03

Support multiple sockets

dfd0226

Use process_output directly, not through process

a262e37

No unsafe

7681cf0

mxinden mentioned this pull request Apr 4, 2024

refactor(endpoint): use array::from_fn instead of unsafe MaybeUninit quinn-rs/quinn#1806

Merged

mxinden added 5 commits April 5, 2024 13:31

Don't await input if there are events pending

c2b0721

Prevent busy loop

5d314b6

Revert "Prevent busy loop"

3beb0c5

This reverts commit 5d314b6.

Don't allocate vec and hashmap in hot path

611f531

mxinden mentioned this pull request Apr 7, 2024

refactor(client): use process_output and process_multiple_input #1794

Merged

process_output after each input batch

6e6801c

mxinden mentioned this pull request Apr 9, 2024

fix(bin): don't sleep when events available #1806

Merged

mxinden added 4 commits April 16, 2024 18:55

Merge branch 'main' of https://github.com/mozilla/neqo into recvmmsg

3f63dc2

fix(SendMessage): always use available send space

1a7d2bf

Always use up send space on QUIC layer to ensure receiving `ConnectionEvent::SendStreamWritable` event when more send space is available. See mozilla#1819 for details. This commit implements option 2. Fixes mozilla#1819.

fix: change is_some to is_none

38fca6b

Disable pacing

7052274

mxinden closed this Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(udp): use sendmmsg and recvmmsg #1741

feat(udp): use sendmmsg and recvmmsg #1741

mxinden commented Mar 13, 2024 •

edited

Loading

github-actions bot commented Mar 15, 2024 •

edited

Loading

codecov bot commented Mar 15, 2024 •

edited

Loading

mxinden commented Mar 15, 2024

larseggert commented Mar 15, 2024

mxinden commented Mar 16, 2024

mxinden commented Apr 3, 2024

larseggert commented Apr 3, 2024

mxinden commented Apr 4, 2024

mxinden commented Apr 5, 2024

larseggert commented Jun 18, 2024

mxinden commented Jul 1, 2024

feat(udp): use sendmmsg and recvmmsg #1741

feat(udp): use sendmmsg and recvmmsg #1741

Conversation

mxinden commented Mar 13, 2024 • edited Loading

github-actions bot commented Mar 15, 2024 • edited Loading

Benchmark results

Client/server transfer results

codecov bot commented Mar 15, 2024 • edited Loading

Codecov Report

mxinden commented Mar 15, 2024

larseggert commented Mar 15, 2024

mxinden commented Mar 16, 2024

mxinden commented Apr 3, 2024

larseggert commented Apr 3, 2024

mxinden commented Apr 4, 2024

mxinden commented Apr 5, 2024

larseggert commented Jun 18, 2024

mxinden commented Jul 1, 2024

mxinden commented Mar 13, 2024 •

edited

Loading

github-actions bot commented Mar 15, 2024 •

edited

Loading

codecov bot commented Mar 15, 2024 •

edited

Loading