Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cancellation on disconnect, & parallelize Helper aggregation computations. #3119

Merged
merged 4 commits into from
May 15, 2024

Conversation

branlwyd
Copy link
Member

@branlwyd branlwyd commented May 10, 2024

This will be very helpful when we have requests which timeout, which is presented to Janus as a client disconnect. Otherwise, we would continue processing the request until it is complete.

Previously, we used a rayon threadpool for aggregation computations, but the computations within the handling of a single aggregation job were still serialized. Now, we use a parallel iterator for handling each report share in turn, allowing the VDAF evaluations to be performed in parallel. The Helper methods are also updated such that the parallel computations respect cancellation. Specifically, the receiver will be dropped on cancellation, causing the producer threads' sender to return a SendError when they attempt to send, which will cause try_for_each_with to stop processing, which will cause the producer thread to terminate.

Also, remove the (unstated!) requirement in aggregation_job_writer than report aggregations be provided in order. This eases parallelization of the aggregation job continuation logic.

@branlwyd branlwyd requested a review from a team as a code owner May 10, 2024 21:53
@branlwyd
Copy link
Member Author

A couple of things to pay attention to in review:

  • This is mostly "free" parallelism outside of whatever rayon's inherent synchronization costs are, but there are a couple of places that aren't free. First, in aggregation job continuation we now serially build up a new vector associating prepare steps to report aggregations; I think this is required since otherwise we'd need to mutate the report aggregation iterator from multiple callbacks in parallel. Second, I added a sort over report aggregations to aggregation_job_writer, to remove the requirement that report aggregations be in order. If we can drop either of these, we'd see some efficiency gains.
  • I'm not totally sure traces/spans are being performed properly, please pay close attention here.

@branlwyd
Copy link
Member Author

branlwyd commented May 10, 2024

Part of #3117.

Copy link
Contributor

@divergentdave divergentdave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the changes closely yet, but I have some early feedback based on testing.

As for tracing spans, we will want to create a span, make it a child of the current span, and pass it to the closure currently used in ParallelIterator::map(). (similar to before) There, we should enter the span at the top of the closure. This will ensure our events logged inside the closure have an enclosing span. Note that the span documentation says "it is entirely valid for multiple threads to enter the same span concurrently", so it's okay to share one child span. We can pass it as the initialization argument of map_with(), so it gets cloned once per worker thread.

aggregator/src/aggregator.rs Outdated Show resolved Hide resolved
aggregator/src/aggregator.rs Outdated Show resolved Hide resolved
@branlwyd branlwyd changed the title Helper: parallelize aggregation computations. Implement cancellation on disconnect, & parallelize Helper aggregation computations. May 13, 2024
@branlwyd
Copy link
Member Author

I merged #3131 into this, as it addresses all current comments. I will update Leader aggregation methods in a follow-on PR, to parallelize the computations & respect cancellation.

Stacked on #3119.
Part of #3033.
Part of #3035.

branlwyd and others added 3 commits May 15, 2024 14:03
Previously, we used a rayon threadpool for aggregation computations, but
the computations within the handling of a single aggregation job were
still serialized. Now, we use a parallel iterator for handling each
report share in turn, allowing the VDAF evaluations to be performed in
parallel.

Also, remove the (unstated!) requirement in aggregation_job_writer than
report aggregations be provided in order. This eases parallelization of
the aggregation job continuation logic.
This will be very helpful when we have requests which timeout, which is
presented to Janus as a client disconnect. Otherwise, we would continue
processing the request until it is complete.

Update Helper aggregation methods, which use rayon to parallelize
processing, to respect cancellation. Specifically, the `receiver` will
be dropped, causing the producer's `sender` to return a SendError, which
will cause `try_for_each_with` to stop processing.
Also,
 * Receive only 10 (arbitrarily chosen) messages per call to
   `recv_many`. This will give more await points to cancel on during
   VDAF computations.
 * Rename a variable.
@branlwyd branlwyd force-pushed the bran/parallel-report-processing branch from 0262b90 to d756a2f Compare May 15, 2024 21:18
@branlwyd
Copy link
Member Author

(Rebased on latest main.)

aggregator/src/aggregator.rs Outdated Show resolved Hide resolved
aggregator/src/aggregator.rs Show resolved Hide resolved
aggregator/src/aggregator.rs Outdated Show resolved Hide resolved
aggregator/src/aggregator/aggregation_job_continue.rs Outdated Show resolved Hide resolved
@branlwyd branlwyd enabled auto-merge (squash) May 15, 2024 22:59
@branlwyd branlwyd mentioned this pull request May 15, 2024
@branlwyd branlwyd merged commit 7cc5b00 into main May 15, 2024
8 checks passed
@branlwyd branlwyd deleted the bran/parallel-report-processing branch May 15, 2024 23:12
@branlwyd branlwyd mentioned this pull request May 15, 2024
branlwyd added a commit that referenced this pull request May 16, 2024
This implements for the Leader what #3119 did for the Helper. Like with
the Helper, a rayon threadpool is used to parallelize aggregation
computations. These computations respect cancellation (though I'm not
sure if anything short of process death will currently cancel the
aggregation_job_driver's computations).
branlwyd added a commit that referenced this pull request May 20, 2024
This implements for the Leader what #3119 did for the Helper. Like with
the Helper, a rayon threadpool is used to parallelize aggregation
computations. These computations respect cancellation (though I'm not
sure if anything short of process death will currently cancel the
aggregation_job_driver's computations).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants