-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We don't retry E2EE to-device messages at all. #12851
Comments
Does the order of the to_device messages need maintaining? |
Sort of. Weird things may happen if you try to send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message. (If you don't have the "receive a to_device message from that same device" step, then everything should be fine.) But it would probably be fine if we just retry sending immediately one or two times. |
The issue I'm thinking is that if one request fails there is a decent probability an immediate retry will too, which might mean we need to queue and issue in-order retrying with a back-off or something and persisting so that we don't forget when the app gets restarted |
What should happen in this case |
Hmm. The problem with that case is that if you receive a message from a device, and then send a new message, then the ratchet would be advanced in such a way that any old messages that you had intended to send will be undecryptable. But if you retry the old message (and it succeeds) before sending the new message, then all should be fine. So perhaps one way to solve this would be to put all the to-be-sent to-device messages in a queue, and don't send the next one until the current one has been sent (or we've given up)? |
At least vodozemac would happily handle this case since it retains a number of old chains so that it can decrypt messages encrypted with the old chain but arriving late. So things should still work out even if this scenario happens. |
I think all this was trying to say is:
There is no need to drop todevice messages on the floor. Separately, in future we'll need the ability to cancel to-device messages which are no longer relevant - both from being stored on the server, and being queued up to be sent to the server in the first place. For instance, MSC3401 needs this for killing off irrelevant stale VoIP events. So, if the application knows a message should be unqueued/deleted, it can do so. But the stack itself should just do reliable delivery by default. This causes a large number of my most obnoxious UISIs fwiw - c.f. the backlinks above. It can also undermine MSC3401 pretty badly. |
Having looked at this for a bit, an initial plan would be to just make the to-device API match the event sending API and make it retry automatically for a period of time, then reject with a mechanism to initiate a retry manually. However, if we really make it match then this probably wouldn't help a great deal since regular messages fail after about 30 seconds, after which the user manually retries sending them. The to-device messages would therefore either need to retry for (much) longer automatically, or we'd need to automatically retry the to-device message queue when retrying a regular message. At the moment I'm a bit torn on whether to try & retry them maintaining API compatibility or whether to make a new API altogether (the current one is very basic and not very pretty). The main problem is that the current API is a bit fire-and-forget, so we need to avoid to-device messages building up in a queue if we started retrying them without changing the API. |
Can we double-check that we also have reliable to-device messages on mobile. |
Have created equivalent bugs on iOS & Android. |
* Add txn_id support to sliding sync ([\matrix-org#2567](matrix-org#2567)). * Emit an event when the client receives TURN servers ([\matrix-org#2529](matrix-org#2529)). * Add support for stable prefixes for MSC2285 ([\matrix-org#2524](matrix-org#2524)). * Remove stream-replacement ([\matrix-org#2551](matrix-org#2551)). * Add support for sending user-defined encrypted to-device messages ([\matrix-org#2528](matrix-org#2528)). * Retry to-device messages ([\matrix-org#2549](matrix-org#2549)). Fixes element-hq/element-web#12851. * Sliding sync: add missing filters from latest MSC ([\matrix-org#2555](matrix-org#2555)). * Use stable prefixes for MSC3827 ([\matrix-org#2537](matrix-org#2537)). * Fix: Handle parsing of a beacon info event without asset ([\matrix-org#2591](matrix-org#2591)). Fixes element-hq/element-web#23078. * Fix finding event read up to if stable private read receipts is missing ([\matrix-org#2585](matrix-org#2585)). Fixes element-hq/element-web#23027. * Fixed a sliding sync issue where history could be interpreted as live events. ([\matrix-org#2583](matrix-org#2583)). * Don't load the sync accumulator if there's already a sync persist in flight ([\matrix-org#2569](matrix-org#2569)).
Co-authored-by: github-merge-queue <[email protected]> Co-authored-by: Michael Telatynski <[email protected]>
Looking at this series of utter failures to send E2E messages, it doesn't look healthy:
specifically, the 500s never get retried. @uhoreg thinks this may in turn desync and wedge the olm sessions in question.
The text was updated successfully, but these errors were encountered: