Multithreaded crypto #170

poljar · 2021-03-09T18:30:50Z

This PR aims to parallelize the main heavy paths of the crypto that we are able to parallelize:

Parallel room key encryption
Parallel key query response handling
Improve the get_missing_sessions() method
Abstraction over an executor where we can spawn futures.

Notably the key claiming and Olm session creation path (which does the tripple diffie-hellman calculation) is missing here. The key claiming path can't efficiently be parallelized yet since creating a session needs to mutably borrow the libolm Account.

The method that is used in the C library can be found here. It takes a reference to the Account but realistically it doesn't need to take a mutable reference to it considering that it only needs to read the curve25519 identity key from the Account as described here.

I'm trying bench the concrete improvements we're going to get, the image bellow shows how much this PR improves room key encryption.

The next image shows the improvement for key queries, please note that the key query response that the benchmark uses is heavy on devices and light on users and very light on cross signing keys. Parallelizing the handling of cross signing keys thus hasn't been done yet, and won't likely land as part of this PR.

matrix_sdk_crypto/src/session_manager/group_sessions.rs

Co-authored-by: Jonas Platte <[email protected]>

… method

poljar · 2021-03-10T19:47:42Z

Some realish world measurements follow.

The measurements were done using complement which creates a room with a configured amount of users each having a single E2EE capable device.

The measurement shows the time it takes to create outbound Olm sessions, encrypt a room key for each Olm session, and finally send out all the to-device requests carrying the encrypted room key. This is done by inspecting when a keys claim request is sent out, and when the last to-device requests was sent out, the duration between those two events is our recorded time.

The x86_64 measurement were done using an 8 core Ryzen 7 4750U while the aarch64 were done using an 8 core Snapdragon 665. Please note that the old measurement for Element-Android was not done using the rust-sdk so only the green line, the rust-sdk on x86_64 is an apples to apples comparison.

The measurement from before

And now after applying this PR

Testing this also revealed that the current slow path for such large groups seems to be this method which is called right before sending out an keys claim request. We should fix this while we're here, making it feasible to normally use encrypted rooms containing 2k members, well, at least on a modern x86_64 CPU.

…d store This removes a massive performance issue since getting sessions is part of every message send cycle. Namely every time we check that all our devices inside a certain room have established 1 to 1 Olm sessions we load the account from the store. This means we go through an account unpickling phase which contains AES decryption every time we load sessions. Cache instead the account info that we're going to attach to sessions when we initially save or load the account.

poljar · 2021-03-11T18:57:26Z

The issue with the get_missing_sessions() method was luckily an embarrassing issue in the sled store, and not some fundamental flaw with the logic. Performance is now looking good. The next image shows how the method behaves for a group of members with 2000 devices.

Such improvement 😅

poljar · 2021-03-12T11:21:15Z

One remaining mystery was solved, sharing an room key for 150 members (collecting the devices and Olm sessions, encrypting the room key, saving the changed sessions to the db) takes much longer when using the sled store vs using the memory only store.

The following image shows the discrepancy:

So ~2ms vs 6.6ms.

The flamegraph reveals what's going on

The sled store has to pickle all the ~150 sessions that encrypted the room key since an encryption step moves forward one of the ratchets inside the Session, the memory store skips this step completely. The sled store has to perform two AES encryption steps, one for the room key and one for the Session. This could probably be improved with a better AES implementation but we'll leave that for another time.

The measurements here are showing the worst case for this method call, since the AES encryption path would only be hit if a room key has to be shared, just checking if it needs to be shared should perform the same way with both stores.

We were merging the to-device messages using the extend() method while our data has the shape of BTreeMap<UserId, BTreeMap<_, _>>, extending such a map would mean that the inner BTreeMap would get dropped if both maps contain the same UserId. We need to extend the inner maps, those are guaranteed to contain unique device ids.

matrix_sdk_crypto/src/session_manager/group_sessions.rs

poljar · 2021-03-17T09:16:04Z

To create an executor abstraction we'll probably use the async_executors crate, which supports WASM.

poljar added 3 commits March 9, 2021 14:24

benches: Run the async benches on a tokio runtime.

91c326e

crypto: Encrypt room keys for a room key share request in parallel

a8bc619

crypto: Encrypt the share group session requests in parallel.

560aa5b

poljar marked this pull request as draft March 9, 2021 18:31

jplatte reviewed Mar 9, 2021

View reviewed changes

matrix_sdk_crypto/src/session_manager/group_sessions.rs Outdated Show resolved Hide resolved

poljar and others added 6 commits March 10, 2021 09:54

crypto: Fix a typo.

4a8c305

Co-authored-by: Jonas Platte <[email protected]>

crypto: Remove an unneeded import.

aff5cdd

crypto: Remove some stale TODO comments

0c5d13c

crypto: Calculate the device changes for a given user in parallel

c8d4cd0

crypto: Move the tracked users marking out of the device key handling…

570bd2e

… method

crypto: Go through the user device keys in parallel

daf313e

poljar added 3 commits March 11, 2021 13:28

benches: Fix the key claiming bench, it needs to run under tokio now

a32f918

benches: Add a benchmark for the missing session collecting

d4e847f

crypto: Fix a clippy warning

7465574

poljar added 2 commits March 12, 2021 16:33

crypto: Send bigger sendToDevice requests out that carry our room keys

880818a

jplatte reviewed Mar 12, 2021

View reviewed changes

matrix_sdk_crypto/src/session_manager/group_sessions.rs Outdated Show resolved Hide resolved

crypto: Simplify counting the number of messages a to-device request has

75ac295

poljar mentioned this pull request Mar 22, 2021

client: Add SSO Login #176

Merged

2 tasks

poljar added 2 commits March 23, 2021 10:18

common: Add a executor abstraction so we can spawn tasks under WASM

bbe812f

matrix-sdk: Fix the WASM example

12bf0f5

poljar merged commit 12bf0f5 into master Mar 23, 2021

bwindels mentioned this pull request Apr 27, 2021

Sharing a key in room of > 500 devices fails element-hq/hydrogen-web#330

Open

poljar deleted the multithreaded-crypto branch May 7, 2021 15:48

pmaier1 mentioned this pull request Apr 4, 2023

Sending encrypted events is slow and takes ~60s in a room with 1000 devices element-hq/element-web#15476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded crypto #170

Multithreaded crypto #170

poljar commented Mar 9, 2021 •

edited

Loading

poljar commented Mar 10, 2021 •

edited

Loading

poljar commented Mar 11, 2021

poljar commented Mar 12, 2021

poljar commented Mar 17, 2021

Multithreaded crypto #170

Multithreaded crypto #170

Conversation

poljar commented Mar 9, 2021 • edited Loading

poljar commented Mar 10, 2021 • edited Loading

poljar commented Mar 11, 2021

poljar commented Mar 12, 2021

poljar commented Mar 17, 2021

poljar commented Mar 9, 2021 •

edited

Loading

poljar commented Mar 10, 2021 •

edited

Loading