Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded crypto #170

Merged
merged 18 commits into from
Mar 23, 2021
Merged

Multithreaded crypto #170

merged 18 commits into from
Mar 23, 2021

Conversation

poljar
Copy link
Contributor

@poljar poljar commented Mar 9, 2021

This PR aims to parallelize the main heavy paths of the crypto that we are able to parallelize:

  • Parallel room key encryption
  • Parallel key query response handling
  • Improve the get_missing_sessions() method
  • Abstraction over an executor where we can spawn futures.

Notably the key claiming and Olm session creation path (which does the tripple diffie-hellman calculation) is missing here. The key claiming path can't efficiently be parallelized yet since creating a session needs to mutably borrow the libolm Account.

The method that is used in the C library can be found here. It takes a reference to the Account but realistically it doesn't need to take a mutable reference to it considering that it only needs to read the curve25519 identity key from the Account as described here.

I'm trying bench the concrete improvements we're going to get, the image bellow shows how much this PR improves room key encryption.

key_sharing_perf_improvement-0

The next image shows the improvement for key queries, please note that the key query response that the benchmark uses is heavy on devices and light on users and very light on cross signing keys. Parallelizing the handling of cross signing keys thus hasn't been done yet, and won't likely land as part of this PR.

key_query_perf_improvement-0

@poljar poljar marked this pull request as draft March 9, 2021 18:31
@poljar
Copy link
Contributor Author

poljar commented Mar 10, 2021

Some realish world measurements follow.

The measurements were done using complement which creates a room with a configured amount of users each having a single E2EE capable device.

The measurement shows the time it takes to create outbound Olm sessions, encrypt a room key for each Olm session, and finally send out all the to-device requests carrying the encrypted room key. This is done by inspecting when a keys claim request is sent out, and when the last to-device requests was sent out, the duration between those two events is our recorded time.

The x86_64 measurement were done using an 8 core Ryzen 7 4750U while the aarch64 were done using an 8 core Snapdragon 665. Please note that the old measurement for Element-Android was not done using the rust-sdk so only the green line, the rust-sdk on x86_64 is an apples to apples comparison.

The measurement from before

key_share_perf_measurement_old

And now after applying this PR
key_share_perf_measurement

Testing this also revealed that the current slow path for such large groups seems to be this method which is called right before sending out an keys claim request. We should fix this while we're here, making it feasible to normally use encrypted rooms containing 2k members, well, at least on a modern x86_64 CPU.

…d store

This removes a massive performance issue since getting sessions is part
of every message send cycle.

Namely every time we check that all our devices inside a certain room
have established 1 to 1 Olm sessions we load the account from the store.

This means we go through an account unpickling phase which contains AES
decryption every time we load sessions.

Cache instead the account info that we're going to attach to sessions
when we initially save or load the account.
@poljar
Copy link
Contributor Author

poljar commented Mar 11, 2021

The issue with the get_missing_sessions() method was luckily an embarrassing issue in the sled store, and not some fundamental flaw with the logic. Performance is now looking good. The next image shows how the method behaves for a group of members with 2000 devices.

missing_sessions_collection-0

Such improvement 😅

@poljar
Copy link
Contributor Author

poljar commented Mar 12, 2021

One remaining mystery was solved, sharing an room key for 150 members (collecting the devices and Olm sessions, encrypting the room key, saving the changed sessions to the db) takes much longer when using the sled store vs using the memory only store.

The following image shows the discrepancy:
room_key_sharing_bench

So ~2ms vs 6.6ms.

The flamegraph reveals what's going on
key_sharing_sled_flamegraph

The sled store has to pickle all the ~150 sessions that encrypted the room key since an encryption step moves forward one of the ratchets inside the Session, the memory store skips this step completely. The sled store has to perform two AES encryption steps, one for the room key and one for the Session. This could probably be improved with a better AES implementation but we'll leave that for another time.

The measurements here are showing the worst case for this method call, since the AES encryption path would only be hit if a room key has to be shared, just checking if it needs to be shared should perform the same way with both stores.

We were merging the to-device messages using the extend() method while
our data has the shape of BTreeMap<UserId, BTreeMap<_, _>>, extending
such a map would mean that the inner BTreeMap would get dropped if both
maps contain the same UserId.

We need to extend the inner maps, those are guaranteed to contain unique
device ids.
@poljar
Copy link
Contributor Author

poljar commented Mar 17, 2021

To create an executor abstraction we'll probably use the async_executors crate, which supports WASM.

@poljar poljar mentioned this pull request Mar 22, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants