server-key fetching logic is slow and queue-bound #3825

richvdh · 2018-09-07T14:21:11Z

There are a number of problems with the logic in keyring.py for fetching server keys:

each request that needs a key for a given server gets queued up, and it's possible to end up with quite a long queue for a given server. If the lookup is successful, that's ok. However, if it fails (which may take many minutes while we wait for timeouts), then we try again for each request in the queue - so we can rapidly end up getting very badly behind. When we want key X for server Y, if there is already a request in the queue for that key, then we should just use the results from it, even if it fails.
relatedly, the queueing logic might never complete. If a given request wants keys from server A and server B, and a lookup is already in progress for A, it waits for that to complete. By that time, another request might be doing a lookup for B, so it waits for that to complete. Then we might be waiting for A again. etc. We should immediately start lookups for those servers which aren't already in progress, rather than waiting for the complete set.
see also store_server_verify_keys shouldn't need to lock the table #3819 and get_keys_from_store should do one big lookup, not hundreds of tiny ones #3818

The text was updated successfully, but these errors were encountered:

richvdh · 2019-05-24T13:54:01Z

relatedly, the queueing logic might never complete.

This is a huge problem while we are joining a room, and is a huge contributor to #1211. In particular:

we do a send_join
servers start sending us federation transactions, which means we need to fetch their keys, so we take out key-fetch locks for those servers
we try to verify the results of the send join, so have to fetch hundreds of keys. Some of those servers are already locked due to the above, so we wait
more transactions arrive from other servers (cf massive storm of presence EDUs when joining a large room #3120) so we lock those key lookups.
The first key lookups complete; go back to 3.

benbz · 2020-12-07T12:03:51Z

This (really the tight looping of Waiting for existing lookups logging in #5435) has come up several times in the past couple of months. When a HS hits this it is effectively unresponsive until it gets restarted

This was referenced Sep 7, 2018

matrix.org had the wrong value for a signing key #3807

Closed

Joining rooms over federation can be very slow (SYN-293) #1211

Closed

hawkowl added the federation-meltdown label Sep 14, 2018

neilisfragile added the z-p2 (Deprecated Label) label Oct 5, 2018

richvdh added the A-Performance Performance, both client-facing and admin-facing label Feb 27, 2019

richvdh mentioned this issue Jun 12, 2019

wait_for_previous_lookups loops a lot #5435

Closed

hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019

erikjohnston mentioned this issue May 21, 2021

Rewrite the KeyRing #10035

Merged

erikjohnston closed this as completed in #10035 Jun 2, 2021

DMRobertson added A-Federation and removed z-federation-meltdown z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers labels Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server-key fetching logic is slow and queue-bound #3825

server-key fetching logic is slow and queue-bound #3825

richvdh commented Sep 7, 2018 •

edited

Loading

richvdh commented May 24, 2019

benbz commented Dec 7, 2020

server-key fetching logic is slow and queue-bound #3825

server-key fetching logic is slow and queue-bound #3825

Comments

richvdh commented Sep 7, 2018 • edited Loading

richvdh commented May 24, 2019

benbz commented Dec 7, 2020

richvdh commented Sep 7, 2018 •

edited

Loading