Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

[500] Internal Server Error when trying to leave room #15967

Closed
sgofferj opened this issue Jul 20, 2023 · 11 comments
Closed

[500] Internal Server Error when trying to leave room #15967

sgofferj opened this issue Jul 20, 2023 · 11 comments
Labels
A-Leave Leaving a room O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@sgofferj
Copy link

Description

I'm trying to leave the matrix HQ room. Element showed me first an error about a missing create event. Now when I try again, I get a [500] Internal Server error.

Steps to reproduce

  • Trying to leave Matrix HQ room

Homeserver

matrix.gofferje.net

Synapse Version

{"server_version":"1.87.0","python_version":"3.11.4"}

Installation Method

Docker (matrixdotorg/synapse)

Database

Postgres, single server, clean setup, no backup

Workers

Single process

Platform

Oracle cloud Ampere instance, Oracle Linux, Docker version 23.0.2, build 569dd73

Configuration

  • No experimental features
  • No presence
  • Message retention:
retention:
  enabled: true

  default_policy:
    min_lifetime: 1d
    max_lifetime: 30d

  allowed_lifetime_min: 1d
  allowed_lifetime_max: 1y

  purge_jobs:
    - longest_max_lifetime: 3d
      interval: 6h
    - shortest_max_lifetime: 3d
      interval: 1d

  media_retention:
    local_media_lifetime: 90d
    remote_media_lifetime: 14d
  • No modules

Relevant log output

2023-07-20 08:11:41,767 - synapse.http.server - 133 - ERROR - POST-441428 - Failed handle request via 'RoomMembershipRestServlet': <XForwardedForRequest at 0xfffd830c8f50 method='POST' uri='/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm%3Amatrix.org/leave' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/http/server.py", line 319, in _async_render_wrapper
    callback_return = await self._async_render(request)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/http/server.py", line 529, in _async_render
    callback_return = await raw_callback_return
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/rest/client/room.py", line 1072, in on_POST
    return await self._do(request, requester, room_id, membership_action, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/rest/client/room.py", line 1045, in _do
    await self.room_member_handler.update_membership(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/room_member.py", line 642, in update_membership
    result = await self.update_membership_locked(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/room_member.py", line 1103, in update_membership_locked
    return await self._local_membership_update(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/room_member.py", line 500, in _local_membership_update
    await self.event_creation_handler.handle_new_client_event(
  File "/usr/local/lib/python3.11/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/message.py", line 1424, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/message.py", line 1503, in _persist_events
    event = await self.persist_and_notify_client_events(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/message.py", line 1852, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 413, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 304, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 408, in enqueue
    return await self._event_persist_queue.add_to_queue(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 354, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 672, in _persist_event_batch
    res = await self._get_new_state_after_events(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 874, in _get_new_state_after_events
    old_state_groups = {
                       ^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 875, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: '$KJR_JWuneGQDpB2A3LTLT3gGkpVJqmATd8gLlthYOP4'

Anything else that would be useful to know?

No response

@sgofferj sgofferj changed the title Error when trying to leave room [500] Internal Server Error when trying to leave room Jul 20, 2023
@sgofferj
Copy link
Author

Got this now...
Screenshot_20230720_202852

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 285, in on_receive_pdu
    await self._get_missing_events_for_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 805, in _get_missing_events_for_pdu
    await self._process_pulled_events(origin, missing_events, backfilled=False)
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 938, in _process_pulled_events
    await _process_new_pulled_events(fresh_events)
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 895, in _process_new_pulled_events
    await self._process_pulled_event(origin, ev, backfilled=backfilled)
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1001, in _process_pulled_event
    await self._process_received_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1442, in _process_received_pdu
    await self._run_push_actions_and_persist_event(event, context, backfilled)
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2217, in _run_push_actions_and_persist_event
    await self.persist_events_and_notify(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2278, in persist_events_and_notify
    ) = await self._storage_controllers.persistence.persist_events(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 413, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 304, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 408, in enqueue
    return await self._event_persist_queue.add_to_queue(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 354, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 672, in _persist_event_batch
    res = await self._get_new_state_after_events(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 874, in _get_new_state_after_events
    old_state_groups = {
                       ^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 875, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: '$KJR_JWuneGQDpB2A3LTLT3gGkpVJqmATd8gLlthYOP4'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1239, in _process_incoming_pdus_in_room_inner
    await self._federation_event_handler.on_receive_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 289, in on_receive_pdu
    raise Exception(
Exception: Error fetching missing prev_events for $ZZ3OgEyNO1ZiCVyDiL5syuRBu4ld4RY2LfSIfXuAx60: '$KJR_JWuneGQDpB2A3LTLT3gGkpVJqmATd8gLlthYOP4'

@reivilibre
Copy link
Contributor

Please could you check if the room is partial-stated by running this SQL in your Postgres database and saying whether there's a row there?

SELECT * FROM partial_state_rooms WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';

@sgofferj
Copy link
Author

sgofferj commented Jul 22, 2023

Successfully run. Total query runtime: 160 msec.
0 rows affected.

That db size is also getting out of hand. It's at 50GB now and besides that Matrix HQ room, I have 4 local users (2 bots, 2 humans) and 9 very low volume rooms on the server.

@K0HAX
Copy link

K0HAX commented Jul 25, 2023

I'm having the same issue, with the Synapse Admins room #synapse:matrix.org. I also get an effectively identical error.
Also, I tried joining the room using a different user account on my server, and while the matrix.org server received the join message, my server failed to allow the client to join, and now that user (@michaeltest2023072501:productionservers.net) is stuck in that room.
I also can not send messages to that room from my normal account (@michael:productionservers.net)

@MatMaul MatMaul added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely A-Leave Leaving a room labels Jul 27, 2023
@sgofferj
Copy link
Author

sgofferj commented Jul 28, 2023

@reivilibre @MatMaul
My database is now at 85GB from 50GB within that couple of days. I wouldn't consider that minor - I'm fairly sure, that blowing up of the DB comes from that.

It's the state_groups_state table, specifically

@DMRobertson
Copy link
Contributor

Message retention:

Unfortunately message retention is known to have corruption effects, see e.g. #13476 . It's more likely in large rooms like Synapse admins or Matrix HQ.

My advice to you would be:

  1. Turn off message retention to lessen the likelihood of future problems like this.
  2. Try to leave the room using the admin API, see https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#version-2-new-version. In particular I'd suggest using the force_purge option, as it seems that you're unable to send regular leave events.

Can you try that and see it succeeds?

@DMRobertson DMRobertson added the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 3, 2023
@sgofferj
Copy link
Author

sgofferj commented Aug 4, 2023

@DMRobertson
I already forced the user out of the room and wiped the DB by hand, basically doing a

DELETE from xxx where room_id='!OGEhHVWSdvArJzumhm:matrix.org';

for every table which has a room_id field.

Turning off message retention now.

@DMRobertson
Copy link
Contributor

wiped the DB by hand

You've probably gotten away with it, but: please please avoid manually make edits to the Synapse database. Doing so without knowing what you're doing is highly unsafe and we cannot support anyone who has done so without our guidance.

@sgofferj
Copy link
Author

sgofferj commented Aug 4, 2023

Luckily, the schema is fairly easy to understand. To be honest, if it wouldn't have worked, I would have wiped it completely and set it up new. As I wrote, I have 2 real users and a handful of bots... Wiping it would have been a smaller issue than a DB that's growing more than 1 GB per day.

As for my personal problem, your advice to turn off message retention probably has solved it for the future. There was somebody else further up having the same problem but from my point of view there is no further need for this ticket. I leave the decision what to do with it to you. Thank you for your help!

@DMRobertson
Copy link
Contributor

DMRobertson commented Aug 7, 2023

Luckily, the schema is fairly easy to understand.

This does not make manual DB spelunking a good idea. There are a number of footguns.

  • Synapse's schema is generally unconstrained; in particular it lacks a lot of foreign keys which make deletions riskier.
  • Some tables (e.g. those related to state groups and auth chains) have a particularly confusing structure.
  • Manual DB queries do not show up in Synapse logs. If something goes wrong, this makes it extra hard for us to understand why it went wrong.
  • It is particularly unsafe to make these kinds of edits while the application is running.
    • Synapse has in-memory caches (and external caches in Redis) which will now be out-of-sync with the database. There is no means for the caches to become aware of changes other than a restart.
    • Manual queries can race with queries run by the application. The application can prevent these races by coordinating with other workers, or acquiring locks. It tends not to LOCK TABLE nor use SELECT FOR UPDATE or similar mechanisms to acquire these locks robustly in the DB.
  • Making changes to a room only affects your local view of that room. Other homeservers in the room won't see any changes. For example, if you delete all rows containing a certain room ID without making that its users leave the room, all other homeservers in that room will continue to send your homeserver traffic.

@DMRobertson
Copy link
Contributor

In any case, I'm glad things are working for you. Let us know if you have any trouble with that room again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Leave Leaving a room O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

5 participants