Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

branlwyd · 2024-05-10T21:53:07Z

This will be very helpful when we have requests which timeout, which is presented to Janus as a client disconnect. Otherwise, we would continue processing the request until it is complete.

Previously, we used a rayon threadpool for aggregation computations, but the computations within the handling of a single aggregation job were still serialized. Now, we use a parallel iterator for handling each report share in turn, allowing the VDAF evaluations to be performed in parallel. The Helper methods are also updated such that the parallel computations respect cancellation. Specifically, the receiver will be dropped on cancellation, causing the producer threads' sender to return a SendError when they attempt to send, which will cause try_for_each_with to stop processing, which will cause the producer thread to terminate.

Also, remove the (unstated!) requirement in aggregation_job_writer than report aggregations be provided in order. This eases parallelization of the aggregation job continuation logic.

branlwyd · 2024-05-10T21:56:46Z

A couple of things to pay attention to in review:

This is mostly "free" parallelism outside of whatever rayon's inherent synchronization costs are, but there are a couple of places that aren't free. First, in aggregation job continuation we now serially build up a new vector associating prepare steps to report aggregations; I think this is required since otherwise we'd need to mutate the report aggregation iterator from multiple callbacks in parallel. Second, I added a sort over report aggregations to aggregation_job_writer, to remove the requirement that report aggregations be in order. If we can drop either of these, we'd see some efficiency gains.
I'm not totally sure traces/spans are being performed properly, please pay close attention here.

branlwyd · 2024-05-10T22:42:25Z

Part of #3117.

divergentdave

I haven't read the changes closely yet, but I have some early feedback based on testing.

As for tracing spans, we will want to create a span, make it a child of the current span, and pass it to the closure currently used in ParallelIterator::map(). (similar to before) There, we should enter the span at the top of the closure. This will ensure our events logged inside the closure have an enclosing span. Note that the span documentation says "it is entirely valid for multiple threads to enter the same span concurrently", so it's okay to share one child span. We can pass it as the initialization argument of map_with(), so it gets cloned once per worker thread.

aggregator/src/aggregator.rs

aggregator/src/aggregator/aggregation_job_writer.rs

branlwyd · 2024-05-13T23:33:12Z

I merged #3131 into this, as it addresses all current comments. I will update Leader aggregation methods in a follow-on PR, to parallelize the computations & respect cancellation.

Stacked on #3119.
Part of #3033.
Part of #3035.

Previously, we used a rayon threadpool for aggregation computations, but the computations within the handling of a single aggregation job were still serialized. Now, we use a parallel iterator for handling each report share in turn, allowing the VDAF evaluations to be performed in parallel. Also, remove the (unstated!) requirement in aggregation_job_writer than report aggregations be provided in order. This eases parallelization of the aggregation job continuation logic.

This will be very helpful when we have requests which timeout, which is presented to Janus as a client disconnect. Otherwise, we would continue processing the request until it is complete. Update Helper aggregation methods, which use rayon to parallelize processing, to respect cancellation. Specifically, the `receiver` will be dropped, causing the producer's `sender` to return a SendError, which will cause `try_for_each_with` to stop processing.

Also, * Receive only 10 (arbitrarily chosen) messages per call to `recv_many`. This will give more await points to cancel on during VDAF computations. * Rename a variable.

branlwyd · 2024-05-15T21:18:46Z

(Rebased on latest main.)

aggregator/src/aggregator.rs

aggregator/src/aggregator/aggregation_job_writer.rs

aggregator/src/aggregator.rs

aggregator/src/aggregator/aggregation_job_continue.rs

This implements for the Leader what #3119 did for the Helper. Like with the Helper, a rayon threadpool is used to parallelize aggregation computations. These computations respect cancellation (though I'm not sure if anything short of process death will currently cancel the aggregation_job_driver's computations).

branlwyd requested a review from a team as a code owner May 10, 2024 21:53

divergentdave reviewed May 13, 2024

View reviewed changes

aggregator/src/aggregator.rs Outdated Show resolved Hide resolved

aggregator/src/aggregator.rs Outdated Show resolved Hide resolved

aggregator/src/aggregator/aggregation_job_writer.rs Show resolved Hide resolved

divergentdave mentioned this pull request May 13, 2024

Slow copies on Tokio threads #3127

Closed

branlwyd mentioned this pull request May 13, 2024

Implement cancellation on disconnect. #3131

Merged

branlwyd changed the title ~~Helper: parallelize aggregation computations.~~ Implement cancellation on disconnect, & parallelize Helper aggregation computations. May 13, 2024

branlwyd requested a review from divergentdave May 14, 2024 21:39

inahga approved these changes May 15, 2024

View reviewed changes

branlwyd and others added 3 commits May 15, 2024 14:03

Tracing spans.

d756a2f

Also, * Receive only 10 (arbitrarily chosen) messages per call to `recv_many`. This will give more await points to cancel on during VDAF computations. * Rename a variable.

branlwyd force-pushed the bran/parallel-report-processing branch from 0262b90 to d756a2f Compare May 15, 2024 21:18

divergentdave reviewed May 15, 2024

View reviewed changes

Review.

bd354f2

branlwyd requested a review from divergentdave May 15, 2024 22:48

divergentdave approved these changes May 15, 2024

View reviewed changes

branlwyd enabled auto-merge (squash) May 15, 2024 22:59

branlwyd mentioned this pull request May 15, 2024

Cancel on disconnect #2644

Closed

branlwyd merged commit 7cc5b00 into main May 15, 2024
8 checks passed

branlwyd deleted the bran/parallel-report-processing branch May 15, 2024 23:12

branlwyd mentioned this pull request May 15, 2024

Cancel on disconnect #3033

Closed

branlwyd mentioned this pull request May 16, 2024

Parallelize Leader aggregation computations. #3138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

branlwyd commented May 10, 2024 •

edited

Loading

branlwyd commented May 10, 2024

branlwyd commented May 10, 2024 •

edited

Loading

divergentdave left a comment

branlwyd commented May 13, 2024

branlwyd commented May 15, 2024

Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

Conversation

branlwyd commented May 10, 2024 • edited Loading

branlwyd commented May 10, 2024

branlwyd commented May 10, 2024 • edited Loading

divergentdave left a comment

Choose a reason for hiding this comment

branlwyd commented May 13, 2024

branlwyd commented May 15, 2024

branlwyd commented May 10, 2024 •

edited

Loading

branlwyd commented May 10, 2024 •

edited

Loading