Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

One time key counts differ between /sync and /keys/upload #10766

Open
deepbluev7 opened this issue Sep 5, 2021 · 5 comments
Open

One time key counts differ between /sync and /keys/upload #10766

deepbluev7 opened this issue Sep 5, 2021 · 5 comments
Labels
A-E2EE End-to-end encryption for Matrix clients S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-cross-signing (Deprecated Label)

Comments

@deepbluev7
Copy link
Contributor

Description

I noticed that after sending an encrypted in a large E2EE room (#e2ee-test:grin.hu) my key count responses got out of sync (might have happened earlier). On sync I got a count of 55, while when doing an empty key upload request returned a count of 33. Does synapse cache the key count for /sync somehow and that got out of sync?

After uploading some keys, the counts synchronized themselves to 50, so I am assuming the ones in /sync were wrong.

Steps to reproduce

  • after sending some e2ee messages, check what an empty key upload response returns vs the count in /sync.

Version information

  • Homeserver: neko.dev

If not matrix.org:

  • Version: 1.42.0rc1

  • Install method: ebuild

  • Platform: Gentoo
@anoadragon453
Copy link
Member

The count of OTKs retrieved by /sync is in fact cached at the storage layer. However, the same exact cache is also used for the /keys/upload/{deviceId} endpoint.

Are you making use of Synapse worker mode? Perhaps the cache is different across multiple workers.

@anoadragon453 anoadragon453 added the X-Needs-Info This issue is blocked awaiting information from the reporter label Sep 9, 2021
@deepbluev7
Copy link
Contributor Author

I do indeed use synapse in worker mode. And sync is forwarded to a different worker than keys/upload, which is intended as such if I understand the workers documentation correctly.

@anoadragon453
Copy link
Member

Indeed, that is a valid configuration. The storage method I'm talking about in particular is count_e2e_one_time_keys, which does look to be correctly invalidated when keys are added/deleted, and those changes should be streamed out to other workers. So perhaps this is a red herring.

Incremental /sync responses themselves are cached if the same since_token is provided by the client. It sounds somewhat farfetched, but it is possible that your client was sending the same since_token for /sync multiple times? Do you have server and/or client logs of the network activity to check?

@deepbluev7
Copy link
Contributor Author

I don't have any logs for that time period anymore, but this happened across multiple restarts in between of which I received messages. /sync returned the same value after calling upload without any actual keys multiple times (to query the key count on startup to workaround older servers not sending a keycount, if it is 0). So keys weren't added between the different sync and upload calls (until I changed something in our code).

One possible cause I could think of, was that I sent a message in a large, encrypted room. I got around 20 key requests at least at the same time and started uploading keys after the first received ones. So there could potentially have been a race there, where the invalidation happened, a new count was calculated, the invalidation was cleared, but in the mean time the count changed again.

@anoadragon453 anoadragon453 removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Sep 10, 2021
@anoadragon453 anoadragon453 changed the title one time key counts can get out of sync One time key counts differ between /sync and /keys/upload Sep 13, 2021
@anoadragon453 anoadragon453 added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Sep 13, 2021
@anoadragon453
Copy link
Member

I haven't heard of many races involving cache invalidation before in Synapse, but it's certainly a possibility.

I think the first step towards solving this would be for someone to replicate the behaviour in either a Synapse unit test or via Sytest or Complement test. Given it's potentially a race (and possibly one between workers), Sytest or Complement would probably be best for having the behaviour emerge. Note that worker support in Complement is still in progress.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-E2EE End-to-end encryption for Matrix clients S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-cross-signing (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

3 participants