-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
SerializationFailure on appservice outgoing transaction (updating last_txn
)
#11620
Comments
last_txn
)
Update on this: have been looking at the code here and I think the scheduler/AS handler is likely to drop events during a restart of synapse - notably the events are batched in memory as attributes on the This means if the AS pusher instance is in recoverer mode and during that time is restarted, any events sat in memory will be lost as they don't exist on the database. The AS handler completes the Note: this appears to be the case when not in recoverer mode as well, if a request to the AS is ongoing any events currently being held in memory would also be lost upon restart of the instance. |
Dropping events sounds like a different problem - I suggest opening a separate issue for that. In general the AS support code isn't something that changes often, so I for one am not really familiar with how it works (or even how it's meant to work), which means it's hard to answer questions like "could the sending mechanism work differently". |
👍 wasn't clear there - just re-checked and the problem we're seeing looks like:
Neither the AS or appservice pusher worker appear to be under any load so it should be possible to process transactions much quicker. The serialization error, although not a direct problem, is causing the backoff which is a problem. I think this better explains the issue we're seeing!
👍 - we did some poking around what transactions generally look like and they're nowhere near the 100 event limit, so the issue here is much reduced, will open another issue to track it. |
so the end result is slow delivery to the AS? |
Yep, and this leads to a backlog of AS transactions to process and we end up in a situation where the incoming AS txn rate is greater than processed. I believe part of the issue is that during high AS traffic the |
Looping back on this, I may have some time later this week to work on a fix so would be good to agree an apporach. I believe the crux of the problem is that the Did a quick scan of the issues - perhaps #11567 is the resolution here to allow parallel updates to the different columns (unsure if that's the case with the changed isolation level)? |
Using the With
|
connection 1 | connection 2 |
---|---|
BEGIN TRANSACTION; | BEGIN TRANSACTION; |
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; | SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; |
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test'; | |
UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test'; | |
(blocks) | |
COMMIT; | ERROR: could not serialize access due to concurrent update |
With READ COMMITTED
connection 1 | connection 2 |
---|---|
BEGIN TRANSACTION; | BEGIN TRANSACTION; |
SET TRANSACTION ISOLATION LEVEL READ COMMITTED; | SET TRANSACTION ISOLATION LEVEL READ COMMITTED; |
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test'; | |
UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test'; | |
(blocks) | |
COMMIT; | (unblocks) |
COMMIT; |
SELECT * FROM application_services_state WHERE as_id = 'test';
as_id | state | last_txn | read_receipt_stream_id | presence_stream_id | to_device_stream_id
-------+-------+----------+------------------------+--------------------+---------------------
test | | 2 | 20 | |
(1 row)
Description
We are seeing AS' go into recoverer mode after a serialization failure when updating
application_services_state.last_txn
(https://github.com/matrix-org/synapse/blob/v1.48.0/synapse/storage/databases/main/appservice.py#L272). We recently fixed a similar issue in #11195 which targets the same database row, so I suspect the issue now is that thislast_txn
update occaisionally clashes with the stream position update we added the linearizer for.I see that the
last_txn
implementation was added a few years back (0a60bbf) and hasn't changed since - as far as I can tell the only use of this column is to check the previous value hasn't incremented more than one. This does appear to get triggered from time to time still - I can see 12 log entires in the last 7 days in our deployment.Possible Solutions
I notice the recoverer just pulls the lowest transaction and sends that - could the entire sending mechanism work just like that instead? Simply have the
_TransactionController
write out transactions to the database in the batches and have process (currently_Recoverer
) that just pulls from these constantly. This would remove any need to track the last transaction ID for a given AS.Version information
The text was updated successfully, but these errors were encountered: