Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Reproduceable UISIs during federation failures. #5441

Closed
ara4n opened this issue Jun 12, 2019 · 10 comments
Closed

Reproduceable UISIs during federation failures. #5441

ara4n opened this issue Jun 12, 2019 · 10 comments
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@ara4n
Copy link
Member

ara4n commented Jun 12, 2019

I've been memory profiling arasphere.net today, and so have been taking the HS down for a few hours at a time. Afterwards, I reliably have UISIs for E2E messages transmitted in rooms whilst the server was offline. In one instance (matrix.org->arasphere.net) they recovered 5-10 mins after the server recovered. In the others (msgs from vector.modular.im and t2l.io) the UISIs never recovered. This feels very reproduceable.

The devices on arasphere.net have not changed, so this is not the same as #5095.

I'm giving it its own bug rather than losing it in element-hq/element-web/issues/2996

@ara4n
Copy link
Member Author

ara4n commented Jun 12, 2019

This might be element-hq/element-web#3754?

@ara4n
Copy link
Member Author

ara4n commented Jun 24, 2019

I've seen this every time matrix.org has gone down recently (first for the upcloud outage on hestia, and then for today's cloudflare outage)

@JorikSchellekens
Copy link
Contributor

Did you see this during Friday's outage as well?

@ara4n
Copy link
Member Author

ara4n commented Jul 9, 2019

erm, can't remember. i wasn't looking for it, given i was mid-pitch, and the outage was 'only' an hour.

@JorikSchellekens
Copy link
Contributor

JorikSchellekens commented Jul 10, 2019

This seems to be a device list cache problem. It stems from the fact that explicit queries for a remote's device lists do not update Synapse's device_list cache.

  • Assume Alice and Bob share no rooms.
  • When Alice joins Bob in a room Bob will first have to verify Alice's device.
    (This is a symmetric bug from this clean start - the same failure case occurs for both Alice and Bob if the other's homeserver goes offline. However if either's homeserver sent a todevice device_list_update after first contact that user will not be afflicted by UISI.)
  • In order to do so, Bob's Riot will explicitly query the /_matrix/client/r0/keys/query endpoint in order to get Alice's device keys and begin the verification process.
  • Now Bob's Riot will use these keys to start a new megolm session. Note that the explicit query did not put Alice's keys in Bob's homeserver's device cache. Conversation may continue as normal with Riot using the browser's cached device keys.

The failure case is as follows:

  • Alice's homeserver becomes unreachable. This is not a problem in itself because Bob's browser still has Alice's keys.
  • While Alice's homeserver is down Bob takes a look at Alice's profile.
  • Bob's Riot explicitly queries /_matrix/client/r0/keys/query for Alice's keys, which is a cache miss on Bob's homeserver and a network error on the remote query, so the query returns an empty device list.
  • Bob's Riot assumes Alice's devices have been deleted and removes them.
  • A new megolm session is started in order to invalidate those devices.
  • Alice and their homeserver come back online... UISI!

tldr / Steps to reproduce reliably:

  • Run the demo, start an encrypted chat between homeservers and verify all the devices between users.
  • Send a few messages to verify e2e works.
  • Stop a homeserver in the chat.
  • Explicitly look at the device list of a user on that homeserver from another (still active) user in the chat, note that it is now empty.
  • Send some messages from the active user after the lookup while the other is offline.
  • Start the other homeserver.

Sidenote:

I've had another small weird behavioral issue here, when you start the shutdown homeserver again no new messages come in on the chat until someone else sends a message. Is this expected behavior? Has anyone else noticed it? Where should I mention this?

Solutions:

We probably just want to start caching results from the federation /_matrix/federation/v1/user/keys/query endpoint query during the client /_matrix/client/r0/keys/query query. We may also consider sending a device update message to a user when they join a room but this slows joining a room somewhat. (Mostly Erik's weigh in) Ideas? I'd like to open a pr which simply caches the results.

@richvdh
Copy link
Member

richvdh commented Jul 17, 2019

[edited #5441 (comment) lightly for formatting and typo's. Hope I haven't broken it!]

@richvdh
Copy link
Member

richvdh commented Jul 17, 2019

@JorikSchellekens just wanted to give a big 👍 for tracking that issue down! A couple of quick thoughts:

  • Are you aware of any ways this might happen without Bob opening Alice's profile? I'd be a bit surprised if that is an action that happens often enough to explain all the issues Matthew is reporting.
  • In your sidenote: "no new messages come in on the chat until someone else sends a message": I'm not quite sure what you mean. Do you mean that messages sent while the server was offline do not appear? If so, yes, that's a known issue: Homeservers don't catch up with missed traffic until someone sends another event #2528.

@JorikSchellekens
Copy link
Contributor

Thanks @richvdh

  • The fundamental problem is that it will happen any time Riot queries /_matrix/client/r0/keys/query while the other homeserver is down and their devices are not in the cache.

    Now that you've mentioned it I just figured out a more commons case: if a user leaves and joins a room while some remote server is down, Riot will clear it's local cache of any users in the room for which it was the only room they shared and it will update them again on joining using /_matrix/client/r0/keys/query. Any message sent by this user will be an UISI for the users on the homeserver that is down, whose devices were not in the cache. That would make this common on much larger rooms.

    Signing in a new device may trigger it as well, possibly, I haven't tested that.

    I'm not aware of how many ways this can happen. On a larger homeserver with more users in more chats than what I've been testing on there may be more cases where this occurs. I also think it may be more common for people who've just joined matrix HQ or similar to look at Mathew's profile than most other people's. However, both of those ideas are just speculation.

  • Yes, Homeservers don't catch up with missed traffic until someone sends another event #2528 was exactly the behavior I was referring to.

@richvdh
Copy link
Member

richvdh commented Aug 15, 2019

are we assuming this is fixed by #5693?

@richvdh
Copy link
Member

richvdh commented Nov 25, 2019

let's assume it was fixed by #5693.

@richvdh richvdh closed this as completed Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants