-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(udp): use sendmmsg and recvmmsg #1741
Conversation
1995639
to
c06253f
Compare
Benchmark resultsPerformance differences relative to 3e53709.
Client/server transfer resultsTransfer of 134217728 bytes over loopback.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1741 +/- ##
==========================================
- Coverage 93.26% 93.22% -0.04%
==========================================
Files 110 110
Lines 35669 35680 +11
==========================================
- Hits 33266 33262 -4
- Misses 2403 2418 +15 ☔ View full report in Codecov by Sentry. |
08b4f32
to
a2f2d82
Compare
Status update: Thus far I am having a hard time finding clear signals that this is a performance improvement. Not saying it isn't. Just missing a proof. Duration of client/server transfers on benchmark machine vary significantly (30s - 45s) across CI runs. Same for I/O syscalls samples in perf traces. Maybe worth hooking into something like criterion to get stable metrics and filter out outliers. I will investigate further. Suggestions welcome. |
Remember that pacing interferes here. You might want to log how large the batches are that you are actually doing on TX/RX. I think @KershawChang had a feature to turn off pacing, which should increase those batches. As discussed elsewhere, we likely have other bugs/bottlenecks in the code that limit the impact this change will have, too. |
Thank you for the input.
Will do.
It is actually disabled by default, though by mistake. Thanks for the hint! See #1753. |
a8a1b80
to
4427469
Compare
Read up to `BATCH_SIZE = 32` with single `recvmmsg` syscall. Previously `neqo_bin::udp::Socket::recv` would use `recvmmsg`, but provide a single buffer to write into only, effectively using `recvmsg` instead of `recvmmsg`. With this commit `Socket::recv` provides `BATCH_SIZE` number of buffers on each `recvmmsg` syscall, thus reading more than one datagram at a time if available.
Small update: Solely using I hacked together a
|
|
🎉 40% improvement in throughput. First (?) time breaking 1 Gbit/s on CI. |
Currently investigating why performance regresses here. It does not seem to be tied to the
|
`<SendMessage as Sendstream>::send_data` attempts to send a slice of data down into the QUIC layer, more specifically `neqo_transport::Connection::stream_send_atomic`. While it attempts to send any existing buffered data at the http3 layer first, it does not itself fill the http3 layer buffer, but instead only sends data, if the lower QUIC layer has capacity, i.e. only if it can send the data down to the QUIC layer right away. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L168-L221 `<SendMessage as Sendstream>::send_data` is called via `Http3ServerHandler::send_data`. The wrapper first marks the stream as `stream_has_pending_data`, marks itself as `needs_processing` and then calls down into `<SendMessage as Sendstream>::send_data`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/connection_server.rs#L51-L74 Thus the latter always marks the former as `stream_has_pending_data` even though the former never writes into the buffer and thus might actually not have pending data. Why is this problematic? 1. Say that the buffer of the `BufferedStream` of `SendMessage` is empty. 2. Say that the user attempts to write data via `Http3ServerHandler::send_data`. Despite not filling the http3 layer buffer, the stream is marked as `stream_has_pending_data`. 3. Say that next the user calls `Http3Server::process`, which will call `Http3Server::process_http3`, which will call `Http3ServerHandler::process_http3`, which will call `Http3Connection::process_sending`, which will call `Http3Connection::send_non_control_streams`. `Http3Connection::send_non_control_streams` will attempt to flush all http3 layer buffers of streams marked via `stream_has_pending_data`, e.g. the stream from step (2). Thus it will call `<SendMessage as SendStream>::send` (note `send` and not the previous `send_data`). This function will attempt the stream's http3 layer buffer. In the case where the http3 layer stream buffer is empty, it will enqueue a `DataWritable` event for the user. Given that the buffer of our stream is empty (see (1)) such `DataWritable` event is always emitted. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-http3/src/send_message.rs#L236-L264 The user, on receiving the `DataWritable` event will attempt to write to it via `Http3ServerHandler::send_data`, back to step (2), thus closing the busy loop. How to fix? This commit adds an additional check to the `has_pending_data` function to ensure it indeed does have pending data. This breaks the above busy loop.
This reverts commit 5d314b6.
`neqo_transport::Connection` offers 3 process methods: - `process` - `process_output` - `process_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). Extracted from mozilla#1741.
`neqo_transport::Connection` offers 3 process methods: - `process` - `process_output` - `process_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). Extracted from mozilla#1741.
`neqo_transport::Connection` offers 4 process methods: - `process` - `process_output` - `process_input` - `process_multiple_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Where `process_input` delegates to `process_multiple_input`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_multiple_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). By using `process_multiple_input` instead of `process` or `process_input`, auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input datagram batch, and not for each input datagram. Extracted from mozilla#1741.
* refactor(client): use process_output and process_multiple_input `neqo_transport::Connection` offers 4 process methods: - `process` - `process_output` - `process_input` - `process_multiple_input` Where `process` is a wrapper around `process_input` and `process_output` calling both in sequence. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L1099-L1107 Where `process_input` delegates to `process_multiple_input`. https://github.com/mozilla/neqo/blob/5dfe106669ccb695187511305c21b8e8a8775e91/neqo-transport/src/connection/mod.rs#L979-L1000 Previously `neqo-client` would use `process` only. Thus continuously interleaving output and input. Say `neqo-client` would have multiple datagrams buffered through a GRO read, it could potentially have to do a write in between each `process` calls, as each call to `process` with an input datagram might return an output datagram to be written. With this commit `neqo-client` uses `process_output` and `process_multiple_input` directly, thus reducing interleaving on batch reads (GRO and in the future recvmmsg) and in the future batch writes (GSO and sendmmsg). By using `process_multiple_input` instead of `process` or `process_input`, auxiliarry logic, like `self.cleanup_closed_streams` only has to run per input datagram batch, and not for each input datagram. Extracted from #1741. * process_output before handle * process_ouput after each input batch
Always use up send space on QUIC layer to ensure receiving `ConnectionEvent::SendStreamWritable` event when more send space is available. See mozilla#1819 for details. This commit implements option 2. Fixes mozilla#1819.
@mxinden is this OBE now? |
Sorry for the delay here. I plan to extract parts of this pull request for #1693. That said, no need for a draft pull request, I can just cherry-pick off of my branch. |
Write and read up to
BATCH_SIZE
datagrams with singlesendmmsg
andrecvmmsg
syscall.Previously
neqo_bin::udp::Socket::send
would usesendmmsg
, but provide a single buffer to write into only, effectively usingsendmsg
instead ofsendmmsg
. Same withSocket::recv
.With this commit
Socket::send
providesBATCH_SIZE
number of buffers on eachsendmmsg
syscall, thus writing more than one datagram at a time if available. Same withSocket::recv
.Part of #1693.