Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

replication timeouts due to message retention purge jobs #16489

Open
verymilan opened this issue Oct 14, 2023 · 4 comments
Open

replication timeouts due to message retention purge jobs #16489

verymilan opened this issue Oct 14, 2023 · 4 comments
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Retention Retention rules to delete messages after a certain amount of time O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@verymilan
Copy link

verymilan commented Oct 14, 2023

Description

Assumingly due to #13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.

Steps to reproduce

  • enable message retention
  • maybe be a big instance idk
  • wait for the scheduled job to execute

Homeserver

tchncs.de

Synapse Version

1.94.0

Installation Method

pip (from PyPI)

Database

PostgreSQL

Workers

Multiple workers

Platform

Debian GNU/Linux 12 (bookworm), dedicated

Configuration

draupnir module, presence, retention

Relevant log output

synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to  POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
SCR-20231014-kdvv
@clokep
Copy link
Member

clokep commented Oct 16, 2023

I think this issue is mostly that "message retention is expensive and must be done on the main process"?

@clokep
Copy link
Member

clokep commented Oct 16, 2023

It seems like there's a few issues here though:

  • Message retention purging being expensive.
  • Message retention being handled by the main process.
  • Replication falling behind due to message retention using all CPU.

I'm unsure of the best approach to solve this, or if all 3 are somewhat needed.

@DMRobertson DMRobertson added A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Retention Retention rules to delete messages after a certain amount of time S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely labels Oct 27, 2023
@abeluck
Copy link

abeluck commented Nov 16, 2023

We want to give our users a message retention promise, but its currently not possible on large servers with databases in the hundreds gigabytes.

Message retention purging being expensive.

I'm not sure this is solvable. Purging involves touching potentially lots of rows.

It seems to me there are a couple possibilities:

  1. Leave purge jobs on the primary process, but cap their run time or do some sort of coroutine style yielding to allow replication traffic to get cpu time. (There is prior art in limiting background jobs to a certain duration)
  2. Move the purge job to a worker. I know this was tried before but was switched back to the primary process for what seems like lack of dev time, rather than a technical reason (though obviously communication happens outside Github issues)
    • This would require moving the purge_history function mentioned in the above issue to a place where workers can access it, thereby moving the purge job off the main process

But since I'm not as familiar with synapse's internals as the devs I can't say which is the most attractive. But this issue is very painful for us and its difficult to explain to stakeholders that despite this feature being well documented, it in fact breaks the server and has been like that for quite a while.

@abeluck
Copy link

abeluck commented Nov 28, 2023

This recent commit removing the experimental warning from retention is not a good idea IMO. While it's fabulous there is no longer risk of corruption bugs, as this issue establishes, there is a serious bug in the retention feature as it exists: your server will essentially stop working (if you have a large number of events that need purging).

fb664 Remove warnings from the docs about using message retention.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Retention Retention rules to delete messages after a certain amount of time O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants