Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Encrypted appservices: how to? #10653

Open
turt2live opened this issue Aug 18, 2021 · 15 comments
Open

Encrypted appservices: how to? #10653

turt2live opened this issue Aug 18, 2021 · 15 comments
Assignees
Labels
A-Application-Service Related to AS support T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Time-Tracked Element employees should track their time spent on this issue/PR.

Comments

@turt2live
Copy link
Member

turt2live commented Aug 18, 2021

The end goal of this adventure is to have an appservice which can participate in encrypted rooms. While this doesn't necessarily provide end-to-end encryption, it does mean that encrypted messages can be decrypted by application services. The driving usecase is largely private context (DM/ping me internally), but is effectively a high-traffic single-user highly reliable bot using the appservice API (sync won't keep up). Other usecases like bridges are worth considering, even if not driving this work, due to the fact that someone will try it despite warnings.

There's 3-4 major pieces in order to get this support working:

  • Sending to-device messages to appservices for their interested users. This has security context to consider.
  • Sending device list changes to appservices.
  • Sending One Time Key (OTK) counts to appservices for each of their interested users.
  • Sending fallback key usage to appservices for each of their interested users (we could probably go without this, but would rather not).

Other considerations, like setting up key backups, are already largely solved due to the masquerading support on the endpoints. Account data changes might need to make it down to appservices, but this can be considered future work for the purposes of this conversation.

Sending device messages (to-device)

This is effectively sending ephemeral events to appservices, which Synapse already supports for read receipts, typing, and presence. Currently the streams appear to operate off a from and to token system, grabbing a batch of events between those tokens with some filter criteria, however for appservices the filter criteria is a bit more nuanced. Specifically, the appservice can easily get to millions of users under its namespace which means millions of inboxes to check. We could add an array of user IDs to the DB fetch function for inboxes to check, however this can, again, be millions of entries long: this would be bad for the DB server.

Appservices already have to care about implicit versus explicit user reservations and will likely have to do additional filtering based upon that, so the proposal I have is largely that the appservice handler grab all device messages for all inboxes between the stream tokens, then filter that down to interested users in code. The function would build two piles: implicit interest and explicit interest. Explicit interest would cause the function to delete those messages from the inbox as they'll shortly be "delivered". Implicit interest means those messages won't be deleted, but will still be sent to the appservice. This is where the security context comes into play: this provides a way for appservices to theoretically intercept device messages without the user knowing. However, the server admin will have had to approve the appservice implicitly by installing it, so it might be fine. This might need more discussion.

An issue with deleting the device messages is reliability: where the code appears to build the transaction and where that transaction is sent to the appservice is disjointed. This isn't important for things like presence, typing, or read receipts (to a degree), however for device messages if the server were to be restarted while an undelivered transaction is in the queue then the messages will be lost, leading to UISIs (in this use case).

If Synapse's streams send events and not just stream positions over replication then the appservice handler might be able to just queue the device messages into the transaction straight from there, thus not having to bother with any stream ID nonsense. However, this assumes that replication is reliable and that the events are in fact sent over replication.

Sending device list changes

The current proposal for this is to send device list changes through at the top level of the transaction: matrix-org/matrix-spec-proposals#3202

Filtering changes is relatively easy as the logic should already exist for presence. Determining when a device list has left the appservice's point of view is a bit more of a challenge, I think, as it'd mean tying the appservice handling to membership events and running a fairly expensive loop of checking all the rooms the user was in prior to the leave and matching those rooms against appservice interest.

For the purposes of determining interest, usually this sort of thing would be checked against whether the appservice was interested in the room or not, however given the intended use case of device lists it seems fair to ensure the appservice has a relevant member in the room which would trigger interest instead. This changes the check to determine if the appservice no longer shares a room with the user after the leave event, which might be even more expensive but involve less spam?

This also has the potential to be the same problem as device messages: reliability during downtime of the appservice is diminished.

OTKs & fallback keys

This is an unsolved problem as of writing. The MSC has a thread to say that this information shouldn't be sent over in a similar fashion to /sync (which always presents it for the user) due to performance concerns: if it always included the information, there'd be millions of objects in the JSON to worry about. Instead, the suggestion is to include the field when the counts/fallback key usage changes. I have no idea if this is even possible to detect in Synapse, but the cheap solution would be to include the fields if the relevant user is receiving a device message.

Possible alternatives

We could invest even more heavily in /sync, however that doesn't help the encrypted bridges use case. A million sync streams sounds worse than an expensive appservice loop, but neither is going to perform well anyways.

@turt2live turt2live added the Z-Time-Tracked Element employees should track their time spent on this issue/PR. label Aug 18, 2021
@erikjohnston
Copy link
Member

Implicit interest means those messages won't be deleted, but will still be sent to the appservice.

Do we have a use case for this? Reading the to device messages doesn't let you decrypt them without knowing the secrets generated by one of the clients.

@turt2live
Copy link
Member Author

Not at present, but there is potential for it to be used in some of the customer sectors. Device messages don't have to be encrypted, and custom events are likely to be involved.

@turt2live
Copy link
Member Author

Just an update on this: I think as a first cut I can bludgeon my way through enough of the project to get something testable and measure performance from that. Then from there we can see just how bad it actually is, similar to the conclusions found in #8903

@anoadragon453
Copy link
Member

The function would build two piles: implicit interest and explicit interest.

I'm not quite clear what messages should go into each pile. Is the explicit one for to-device messages towards users that are in the appservice's namespace, whereas the implicit is just for that users that share rooms?

@turt2live
Copy link
Member Author

It'd be tied to the exclusive: true flag in the appservice's registration: Exclusive is usually used by bridges/appservices which control every aspect of the accounts while Implicit (exclusive: false) tends to be used for appservices which just need to make requests as other users every so often.

An example of an implicit appservice is the communities v2 proxy we ended up writing to prototype an early version of Spaces: it needed to impersonate the caller, so had a @.*:domain implicit namespace that it used very carefully.

@erikjohnston
Copy link
Member

@turt2live have you got what you need from the backend team to be getting on with?

@anoadragon453
Copy link
Member

@erikjohnston Note that I've now taken over the Synapse side of this work.

@anoadragon453 anoadragon453 self-assigned this Aug 26, 2021
@anoadragon453
Copy link
Member

@turt2live One optimisation we could make is deciding which AS should receive a particular to-device message upon its creation/receiving it, and storing those mappings as a table somewhere in the database. This would be in contrast to calculating it all after the fact.

That would be more efficient to pull out, but it does have limitations for some usecases. One I can think of is:

  • @alice:example.com is an existing account with a few devices.
  • An application service is added which includes @alice:example.com in its user namespace regex. It also claims exclusive: true.
  • The application service won't receive historical to-device messages to @alice:example.com's devices as it came in before the AS was set to cover this user.

Is deciding AS destinations when to-device messages come in a problem anywhere, or would that generally be "fine"?

@turt2live
Copy link
Member Author

I think it'd be fine. We also have the appservice flag on the user account to know if the user account was created by the appservice, which should be a good metric for deciding if the appservice has authority over the device messages.

@anoadragon453 anoadragon453 added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Sep 9, 2021
@anoadragon453
Copy link
Member

@turt2live @reivilibre and I had a conversation about how sending one-time key counts and fallback key information to application services would work.

For a bit of background, devices claim one-time keys from other devices when starting a new encryption session with them. These one-time keys are uploaded by a device to the homeserver. Typically for clients, the count of how many one-time keys you have left (per algorithm type) is returned in the response from /sync. When this number starts to get low, the device will spend a bit of effort generating and uploading new one-time keys to replenish them.

Fallback keys are pretty much what they say on the tin. In the case of running out of one-time keys on the server, a fallback key will be returned instead of a one-time key when another device tries to claim it. These fallback keys are initially generated by the device, and can be useful when the device goes offline for a long time while other devices encrypt messages to it. When the device comes back online, they'll be told via /sync that the fallback key was used, and that it should upload a new one (as well as new one-time keys).


The problem is that the current design 1. always returns one-time key counts and 2. always returns the state of whether your fallback has been used in the response for each sync request. This would be very noisy for the many, many potential users an appservice could have interest in. To help mitigate this, the three of us came up with a design to help limit the amount of necessary traffic, while still conveying the same information.

For each EDU and PDU sent to an application service, the AS users that are the recipients of each EDU and PDU in the transaction should have their one-time key counts included for all of their devices. The same goes for fallback keys - however fallback key information is only included when a fallback key has been used (a client's device doesn't need to take action otherwise).

A user is defined as a "recipient" of a PDU/EDU if that message would normally be received by them down /sync. For example, a user would be a recipient of a room message if the user is in the room it is sent. Same goes for typing events. For events not bound to a room, like to-device messages, the user is a recipient if the to-device message was intended for one of their devices.

Note that limiting the sending of these counts, rather than sending them for all users on every AS transaction, is currently considered an implementation optimisation, rather than something that should be baked into the spec - thus why I'm discussing it here rather than on MSC3202.

The application service is expected to use /keys/upload to replenish these keys. One-time key counts for users are expected to lower piecemeal, and thus bulk replenishing keys doesn't seem like that frequent of a user-case.

One outstanding question: should one-time key and fallback key information for sender_localpart users always be included?

@turt2live
Copy link
Member Author

Always including the sender_localpart's counts seems reasonable to me, given it is most likely to run through the keys in a typical usecase.

@reivilibre
Copy link
Contributor

That matches up with what I understood; thanks for writing it up!

Note that limiting the sending of these counts, rather than sending them for all users on every AS transaction, is currently considered an implementation optimisation, rather than something that should be baked into the spec - thus why I'm discussing it here rather than on MSC3202.

I believe the spec only says that they need be sent when they change; as such, sending them whenever a message is inbound for the user is actually chattier than the spec requires.
For future reference, I'll note that there's an edge case here when keys are claimed but then not actually used, the OTK counts have 'changed' but this approach will not send that change to the AS (this is maybe a small spec violation). This is a trade-off we're willing to accept because to do otherwise would probably need a lot of things to be dug up (and notably, we can still do that digging up later — the spec would allow that). The impact should be small because key claims are rate-limited, and if the AS is busy enough then someone will eventually send a message destined for the user in question. Always including sender_localpart will minimise the impact further.

@turt2live
Copy link
Member Author

I've updated MSC3202 with the above to explicitly write down how implementations might want to take this approach, and why it's not too bad.

@erikjohnston
Copy link
Member

It's worth noting a few points for clarity, sorry if this is regurgitating stuff already said.

The OTK pool a user can be exhausted without the user ever receiving an event or to-device message (though this would usually be due to a malicious actor). I think its probably fine as we a) have fallback keys and b) the OTK counts would get sent the next the AS user got a message.

I wonder if we should only include counts for AS users receiving a to-device message? That is the only real reason for an OTK to get consumed. Though means it'll take longer for the AS to see updated counts if a malicious actor has drained the OTK pool.

Has any thought gone into how to handle auto provisioning AS users? Right now a real user can start a DM with an AS user and the AS will auto-provision it if it doesn't already exist. That will no longer work if the AS now has to upload keys before an AS user can receive a message. I'm wondering if a different solution here would be to proxy requests for OTKs to the AS itself? That way the AS can auto-provision things, and we can skip all the logic for sending OTK counts.

@turt2live
Copy link
Member Author

Auto-provisioning should be fine? The request being blocked for user creation can wait a little bit longer while the appservice uploads a handful of keys (instead of all 50 or whatever it ends up wanting to generate). The remainder of the keys can be uploaded in the background.

I wonder if we should only include counts for AS users receiving a to-device message?

It's possible for the counts to diminish without a device message, so the theory was that being involved in an encrypted conversation is the next best point to consider including counts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Time-Tracked Element employees should track their time spent on this issue/PR.
Projects
None yet
Development

No branches or pull requests

5 participants