Investigate why `/messages` is slow #13356

MadLittleMods · 2022-07-22T05:07:41Z

Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

?limit=500: 111.81s (492 events / 129 state events)

?limit=100: 119.504s (92 events / 50 state events)

?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, #10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

144.93s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb
135.3s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d
154.72s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3
185.9s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: 117.808s
Logs in Kibana: GET-6890
- For reference on adjusting the logging config for gitter.ems.host (set DEBUG), see this comment.

Notes:

Why does linearizer lock take ~45s to aquire lock?
It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
For me on the outside, the request took 117.808s. Where did the rest of the time go?
- Delay in connecting to the server?
- Delay in serializing the response?
- Need to time the requests more precisely when the started and ended to compare against the Synapse logs

Timing summary from looking at the logs:

03:42:43.129 (t+0s ): Received request
03:42:43.712 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
- get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
03:43:28.716 (t+45s) Asking t2bot.io for backill
03:43:28.985 (t+45s) t2bot.io responded with 100 events
03:43:29.009 (t+46s) Starting to verify content hashes
03:43:29.105 (t+46s) Done verifying content hashes
03:43:29.106 (t+46s) Processing the pulled events
03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
03:43:34.026 (t+51s) _get_state_groups_from_groups start
03:43:38.470 (t+55s) _get_state_groups_from_groups end
03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs on (GET-196). This time the duration matched the Synapse logs pretty well:

20:38:38.049 (t+0s ) Received request
20:38:38.062 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
20:39:25.625 (t+47s) Asking t2bot.io for backill
20:39:35.262 (t+57s) t2bot.io responded with 100 events
20:39:35.283 (t+...) Starting to verify content hashes
20:39:35.382 (t+...) Done verifying content hashes
20:39:35.383 (t+...) Processing the pulled events
20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
20:39:43.668 (t+65s) Response sent

2. Loading tons of events

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: ~60s
Logs in Kibana: GET-5541
- 1
- 2

Notes:

Tons of get_aggregation_groups_for_event of get_recent_references_for_event calls (2k). Why?
Why did we load 79k events?
- This lines up with the number of state_events in the room
Why did we redact some events? Is this redacting events in our response or in general?
Why didn't these things happen in the other calls above?

Timing summary from looking at the logs:

02:09:51.026 (t+0s ) Received request
02:09:51.959 (t+1s ) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
02:10:10.462 (t+19s) Also fetching redaction events
02:10:10.462 (t+19s) Loaded 79277 events
02:10:23.246 (t+31s) Done redacting 105 events (why?)
02:10:23.779 (t+31s) Asking t2bot.io for backill
02:10:33.290 (t+41s) t2bot.io responded with 100 events
... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
02:10:54.261 (t+63s) Response sent

3. Slow `/state_id` requests but also really slow processing

524 timeout: 123.26s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43

123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab

19.24s /backfill request
53.45s /state_ids

524 timeout: 117.96s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56

117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9

921ms /backfill request
47.24s /state_ids request

524 timeout: 115.64s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33

{
    "ok": false,
    "timing": 99185.70000004768,
    "startTime": 1658527353492,
    "endTime": 1658527452680,
    "requestId": "p8c09g"
}

115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae

14.49s /backfill request
54.47s /state_ids request

200 ok: 83.51s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok

{
    "ok": true,
    "status": 200,
    "timing": 83771.90000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658527887072,
    "endTime": 1658527970847,
    "requestId": "cjfpw",
    "traceId": "ae7c694e57113282"
}

83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282

526ms /backfill request
42.77s /state_ids request

200 ok: 75.7s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm

{
    "ok": true,
    "status": 200,
    "timing": 76131.40000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658528812471,
    "endTime": 1658528888604,
    "requestId": "wg6g8k"
    "traceId": "d048d9f59c20555c"
}

75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c

549ms /backfill request
32.58s /state_ids request

`/messages` timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js

let resultantTimingData = {};
  
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
  // via https://stackoverflow.com/a/8084248/796832
  const requestId = (Math.random() + 1).toString(36).substring(7);

  let res;
  let timeBeforeRes;
  let timeAfterRes;
  let startTime;
  let endTime;
  try {
    startTime = Date.now();
    timeBeforeRes = performance.now();
    res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
      headers: {
        'Authorization': `Bearer ${MATRIX_TOKEN}`
      }
    });
  } catch(err) {
    console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
  } finally {
    endTime = Date.now();
    timeAfterRes = performance.now();
    }
  
  const traceId = res?.headers?.['synapse-trace-id'];
  let data = {};
  if (res?.ok) {
    data = await res.json();
  }

  const timingEntry = {
    ok: res?.ok || false,
    status: res?.status,
    timing: timeAfterRes - timeBeforeRes,
    numEventsReturned: data?.chunk?.length,
    numStateEventsReturned: data?.state?.length,
    startTime,
    endTime,
    requestId,
    traceId
  };

  return timingEntry;
}

(async () => {
  const ROOMS_TO_TEST = {
    'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
    'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
    'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
    'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
    'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
  };

  const NUM_MESSAGES_TO_FETCH_MATRIX = [
    500,
    100,
    20
  ];

  const ONE_MINUTE_MS = 60 * 1000;
  // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
  const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;

  const REPEAT = 5;

  if (!MATRIX_TOKEN) {
    console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
  }

  for (let i = 0; i < REPEAT; i++) {
    for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
      for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
        const roomId = ROOMS_TO_TEST[roomKey];

        const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);


        // Default the result map so we don't run into null-pointer exception (NPE) issues
        if (!resultantTimingData[roomKey]) {
          resultantTimingData[roomKey] = {};
        }

        if (!resultantTimingData[roomKey][numMessagesToFetch]) {
          resultantTimingData[roomKey][numMessagesToFetch] = [];
        }

        resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);

        console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
      }

      // Wait for the caches to clear out before running another trial against the same room
      await new Promise((resolve) => {
        setTimeout(resolve, DELAY_BETWEEN_TRIALS);
      })
    }
  }
})()

function getAvg(roomId, numMessages) {
    const timingEntries = data1[roomId][numMessages];
    return timingEntries
        .filter((timingEntry) => timingEntry.ok)
        .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}

getAvg('matrix:matrix.org', "500")

The text was updated successfully, but these errors were encountered:

The after timing can be influenced by slow `/state_ids` requests when proessing the pulled events. Part of #13356

Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace. Split out from #13440 Follow-up from #13368 Part of #13356

Follow-up to #13534 Part of #13356

…p/reject) (#13534) Related to #13499 Part of #13356

Track number of hosts affected by the rate limiter so we can differentiate one really noisy homeserver from a general ratelimit tuning problem across the federation. Follow-up to #13534 Part of #13356

Follow-up to #13533 Part of #13356

…ild concurrent calls (#13588) Instrument `_check_sigs_and_hash_and_fetch` to trace time spent in child concurrent calls because I've see `_check_sigs_and_hash_and_fetch` take [10.41s to process 100 events](#13587) Fix #13587 Part of #13356

… two events (#13586) Split off from #13561 Part of #13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)

We can follow-up this PR with: 1. Only try to backfill from an event if we haven't tried recently -> #13622 1. When we decide to backfill that event again, process it in the background so it doesn't block and make `/messages` slow when we know it will probably fail again -> #13623 1. Generally track failures everywhere we try and fail to pull an event over federation -> #13700 Fix #13621 Part of #13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.qv7cj51sv9i5)

Part of #13700 and #13356 Follow-up to #13589

Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over. Fix #13622 Fix #8451 Follow-up to #13589 Part of #13356

…se (#13879) There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history. Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after #13635 This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place. Part of #13356

Because we're doing the recording in `_check_sigs_and_hash_for_pulled_events_and_fetch` (previously named `_check_sigs_and_hash_and_fetch`), this means we will track signature failures for `backfill`, `get_room_state`, `get_event_auth`, and `get_missing_events` (all pulled event scenarios). And we also record signature failures from `get_pdu`. Part of #13700 Part of #13676 and #13356 This PR will be especially important for #13816 so we can avoid the costly `_get_state_ids_after_missing_prev_event` down the line when `/messages` calls backfill.

MadLittleMods · 2022-10-07T23:07:26Z

Progress

We've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped #13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to /backfill, /state_ids, /state, /event from federation so it’s not like we can eliminate all slow requests but it seems obvious that they happen a lot less now.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1663806404570&to=1665016004570&viewPanel=230

The effects don't really show for the RoomMessageListRestServlet overall. Perhaps because they have a low ?limit, they don’t have to backfill, maybe they are going forwards instead of backwards, etc. Another influencing factor is that I have only focused on the big requests that took 180s+ before which now take around 30s with the optimizations and this isn’t within range of the Apdex 2.5s satisfied, 10s tolerable thresholds to show progress. I tried playing with those thresholds but didn’t see much. Too many normal fast requests 🤷

What's left?

/messages is still not anywhere close to fast enough as it still takes 30s for my benchmark request I have been running that used to take 180s. A great goal would be to get all of these down to 5s or less.

Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776

As you can see, we need to:

Speed-up filter_events_for_client (this takes 7.83s in that request), filter_events_for_client is time-consuming and expensive #14108
Speed-up get_bundled_aggregations (this takes 3.84s in that request), get_bundled_aggregations takes a few seconds even when there is no relations #13624
Speed up the maybe_backfill part (this takes 15.29s in that request),
- Speed up _process_pulled_events in backfill (this takes 5.15s in that request), Process previously failed backfill events in the background #13623
- 7.24s is from the /backfill outgoing federation request which we have less control over unless we can figure out that we don't actually need to backfill in that situation.
- That still leaves 2.9s of random stuff which should surely be able to go away 🤞

If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s /messages requests. We will have to look at the traces again once we get /messages down to the 10s mark.

…ure is invalid (#13816) While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place. Related to - #13622 - #13635 - #13676 Part of #13356 Follow-up to #13815 which tracks event signature failures. With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now. For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761 To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid. - `backfill` - `outgoing-federation-request` `/backfill` - `_check_sigs_and_hash_and_fetch` - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - `_process_pulled_events` - `_process_pulled_event` for each validated event - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it - `_get_state_ids_after_missing_prev_event` - `outgoing-federation-request` `/state_ids` - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check

Seems like it will make it a magnitude faster. Fix #14108 Part of #13356 (comment)

Fix #13623 Follow-up to #13621 and #13622 Part of making `/messages` faster: #13356

…#15610) Instrument `state` and `state_group` storage related things (tracing) so it's a little more clear where these database transactions are coming from as there is a lot of wires crossing in these functions. Part of `/messages` performance investigation: #13356

Process previously failed backfill events in the background because they are bound to fail again and we don't need to waste time holding up the request for something that is bound to fail again. Fix #13623 Follow-up to #13621 and #13622 Part of making `/messages` faster: #13356

MadLittleMods · 2023-07-06T13:01:27Z

Progress report

The benchmark /messages?dir=b&limit=500 request is down to ~4s now (we originally started at over 185s+)

Full Jaeger trace JSON

Benchmark request: https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D

What's left?

The big things:

get_state_for_events/get_state_ids_for_event -> _get_state_for_groups -> _get_state_groups_from_groups took 1.94s in that request
- This is comprised of long database queries that take 500ms - 1000ms each.
- We've optimized the filter_events_for_client -> get_state_for_events -> _get_state_groups_from_groups path before in Optimize filter_events_for_client for faster /messages - v2 #14527 where we only need the m.room.history_visibility and m.room.member event.
- I also tried tackling this more the more general problem when we have to grab the full state out Re-use work of getting state for a given state_group (_get_state_groups_from_groups) #15617 but found out that our state_group sharing is really sub-par. Maybe it's more interesting to improve our state_group logic to be much smarter first and we could re-visit that PR. Or look into the state compressor stuff to optimize our backlog.
- It would be really nice to get some gains out of this but it's proving hard.
paginate_room_events -> get_events_as_list took 551ms in that request
- ⭐ Seems like lots of individual lookups that we could simplify

The smaller things:

get_bundled_aggregations took 200ms from that request
get_state_for_events/get_state_ids_for_event -> get_state_group_for_events took 229ms from that request
get_state_for_events/get_state_ids_for_event -> _get_state_for_groups_using_cache took 137ms from that request

But all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage.

In terms of waterfalls:

We could save 490ms by running get_state_ids_for_event in parallel with filter_events_for_client since it only needs state at the first event. And we don't need to worry whether the first event would have been filtered because the state just builds on top.
- ⭐ Can we instead just re-use the work to get all the state out for filter_events_for_client when we try to get_state_ids_for_event at the end to return in the /messages response? That would eliminate the 490ms lookup.
⭐ We could save 200ms by running get_bundled_aggregations in parallel with filter_events_for_client. We might grab a few more aggregations but can always filter them down.
We could save 30ms by removing the ignored_users -> are_users_erased sequential call waterfall from filter_events_for_client
- This sometimes runs fast and is negligible so I'm not sure the overhead of running in parallel is worth it.
⭐ Instead of waiting for us to process the whole 500 events before moving on to the next step, could we break it up into a pipeline and stream results through as we go (perhaps 100 at a time).

There is a 160ms mystery gap after encode_json_response

What's achievable?

Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another /messages response that looks like this trace, it could get us down to that target 1.85s mark to say "we've improved /messages by 100x"

MadLittleMods added A-Performance Performance, both client-facing and admin-facing T-Other Questions, user support, anything else. A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels Jul 22, 2022

This was referenced Aug 16, 2022

Add specific metric to time long-running /messages requests #13533

Merged

Add metrics to track how the rate limiter is affecting requests (sleep/reject) #13534

Merged

MadLittleMods added a commit that referenced this issue Aug 16, 2022

Time how long it takes us to do backfill processing

94fb46e

The after timing can be influenced by slow `/state_ids` requests when proessing the pulled events. Part of #13356

MadLittleMods mentioned this issue Aug 16, 2022

Time how long it takes us to do backfill processing #13535

Merged

4 tasks

MadLittleMods added a commit that referenced this issue Aug 16, 2022

Track number of hosts affected by the rate limiter

8be321f

Follow-up to #13534 Part of #13356

MadLittleMods mentioned this issue Aug 16, 2022

Track number of hosts affected by the rate limiter #13541

Merged

4 tasks

MadLittleMods added a commit that referenced this issue Aug 17, 2022

Add metrics to track how the rate limiter is affecting requests (slee…

49d04e4

…p/reject) (#13534) Related to #13499 Part of #13356

MadLittleMods mentioned this issue Aug 18, 2022

Add metrics to track /messages response time by room size #13545

Merged

4 tasks

MadLittleMods added a commit that referenced this issue Aug 18, 2022

Add metrics to track /messages response time by room size (#13545)

2c42673

Follow-up to #13533 Part of #13356

This was referenced Aug 24, 2022

State delta between two events for more efficient delivery and processing #13618

Open

Make the backfill linearizer lock smarter to not block as often #13619

Open

This was referenced Sep 15, 2022

Stop getting missing prev_events after we already know their signature is invalid #13816

Merged

Track when the pulled event signature fails #13815

Merged

Record any exception when processing a pulled event #13814

Merged

MadLittleMods added a commit that referenced this issue Sep 15, 2022

Record any exception when processing a pulled event (#13814)

140af0c

Part of #13700 and #13356 Follow-up to #13589

MadLittleMods mentioned this issue Sep 23, 2022

Limit and filter the number of backfill points to get from the database #13879

Merged

5 tasks

kittykat mentioned this issue Oct 24, 2022

Profile /messages and make it faster v1 #14284

Closed

7 tasks

MadLittleMods mentioned this issue Nov 22, 2022

Optimize filter_events_for_client for faster /messages - v1 #14494

Closed

4 tasks

MadLittleMods added a commit that referenced this issue Nov 22, 2022

Optimize filter_events_for_client for faster /messages

fdf1fd3

Seems like it will make it a magnitude faster. Fix #14108 Part of #13356 (comment)

MadLittleMods mentioned this issue Nov 22, 2022

Optimize filter_events_for_client for faster /messages - v2 #14527

Merged

4 tasks

MadLittleMods mentioned this issue Feb 22, 2023

Scrolling backwards through history is quite slow element-hq/element-web#24550

Closed

kittykat mentioned this issue Mar 2, 2023

Profile /messages and make it even faster v2 #15182

Closed

8 tasks

MadLittleMods added a commit that referenced this issue May 12, 2023

Process previously failed backfill events in the background

fd26164

Fix #13623 Follow-up to #13621 and #13622 Part of making `/messages` faster: #13356

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why `/messages` is slow #13356

Investigate why `/messages` is slow #13356

MadLittleMods commented Jul 22, 2022 •

edited

Loading

MadLittleMods commented Oct 7, 2022

MadLittleMods commented Jul 6, 2023

Investigate why /messages is slow #13356

Investigate why /messages is slow #13356

Comments

MadLittleMods commented Jul 22, 2022 • edited Loading

Background

Traces for reference

Why is it slow?

1. Backfill linearizer lock takes forever

2. Loading tons of events

3. Slow /state_id requests but also really slow processing

/messages timing script

MadLittleMods commented Oct 7, 2022

Progress

What's left?

MadLittleMods commented Jul 6, 2023

Progress report

What's left?

What's achievable?

Investigate why `/messages` is slow #13356

Investigate why `/messages` is slow #13356

MadLittleMods commented Jul 22, 2022 •

edited

Loading

3. Slow `/state_id` requests but also really slow processing

`/messages` timing script