Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

SerializationFailure on appservice outgoing transaction (updating last_txn) #11620

Closed
Fizzadar opened this issue Dec 21, 2021 · 7 comments · Fixed by #12209
Closed

SerializationFailure on appservice outgoing transaction (updating last_txn) #11620

Fizzadar opened this issue Dec 21, 2021 · 7 comments · Fixed by #12209
Labels
A-Application-Service Related to AS support S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Fizzadar
Copy link
Contributor

Description

We are seeing AS' go into recoverer mode after a serialization failure when updating application_services_state.last_txn (https://github.com/matrix-org/synapse/blob/v1.48.0/synapse/storage/databases/main/appservice.py#L272). We recently fixed a similar issue in #11195 which targets the same database row, so I suspect the issue now is that this last_txn update occaisionally clashes with the stream position update we added the linearizer for.

I see that the last_txn implementation was added a few years back (0a60bbf) and hasn't changed since - as far as I can tell the only use of this column is to check the previous value hasn't incremented more than one. This does appear to get triggered from time to time still - I can see 12 log entires in the last 7 days in our deployment.

Possible Solutions

  • move this column to it's own table, not great but would get around the immediate issue
  • fix the ordering issue (or work around it) and remove the column/check entirely

I notice the recoverer just pulls the lowest transaction and sends that - could the entire sending mechanism work just like that instead? Simply have the _TransactionController write out transactions to the database in the batches and have process (currently _Recoverer) that just pulls from these constantly. This would remove any need to track the last transaction ID for a given AS.

Version information

  • v1.48
  • worker based deployment with dedicated appservice pusher worker
@Fizzadar Fizzadar changed the title SerializationFailure on app service outgoing transaction SerializationFailure on appservice outgoing transaction (updating last_txn) Dec 21, 2021
@Fizzadar
Copy link
Contributor Author

Update on this: have been looking at the code here and I think the scheduler/AS handler is likely to drop events during a restart of synapse - notably the events are batched in memory as attributes on the _ServiceQueuer instance.

This means if the AS pusher instance is in recoverer mode and during that time is restarted, any events sat in memory will be lost as they don't exist on the database. The AS handler completes the notify_interested_services background task without any confirmation the events have been stored as an AS transaction in the database.

Note: this appears to be the case when not in recoverer mode as well, if a request to the AS is ongoing any events currently being held in memory would also be lost upon restart of the instance.

@richvdh richvdh added the A-Application-Service Related to AS support label Dec 22, 2021
@richvdh
Copy link
Member

richvdh commented Dec 23, 2021

SerializationFailures aren't, in themselves, problems - we will generally retry any transactions that hit a SerializationFailure. In any case, we should frame this issue in terms of the user-visible symptom, rather than the cause. You say the AS goes into recoverer (recovery?) mode, but again, I'm not sure what impact that actually has on the system - please could you clarify.

Dropping events sounds like a different problem - I suggest opening a separate issue for that.

In general the AS support code isn't something that changes often, so I for one am not really familiar with how it works (or even how it's meant to work), which means it's hard to answer questions like "could the sending mechanism work differently".

@richvdh richvdh added the X-Needs-Info This issue is blocked awaiting information from the reporter label Dec 23, 2021
@Fizzadar
Copy link
Contributor Author

SerializationFailures aren't, in themselves, problems - we will generally retry any transactions that hit a SerializationFailure. In any case, we should frame this issue in terms of the user-visible symptom, rather than the cause. You say the AS goes into recoverer (recovery?) mode, but again, I'm not sure what impact that actually has on the system - please could you clarify.

👍 wasn't clear there - just re-checked and the problem we're seeing looks like:

  • Encounter SerializationError in normal transaction handling
  • Enter recovery mode
  • Encounter a second SerializationError whilst in recovery
  • Recoverer then triggers the backoff code, which causes transactions to build up

Neither the AS or appservice pusher worker appear to be under any load so it should be possible to process transactions much quicker. The serialization error, although not a direct problem, is causing the backoff which is a problem. I think this better explains the issue we're seeing!

Dropping events sounds like a different problem - I suggest opening a separate issue for that.

👍 - we did some poking around what transactions generally look like and they're nowhere near the 100 event limit, so the issue here is much reduced, will open another issue to track it.

@richvdh
Copy link
Member

richvdh commented Dec 23, 2021

Recoverer then triggers the backoff code, which causes transactions to build up

so the end result is slow delivery to the AS?

@Fizzadar
Copy link
Contributor Author

Recoverer then triggers the backoff code, which causes transactions to build up

so the end result is slow delivery to the AS?

Yep, and this leads to a backlog of AS transactions to process and we end up in a situation where the incoming AS txn rate is greater than processed.

I believe part of the issue is that during high AS traffic the application_services_state table is written to by both the handler to update the stream position and also the scheduler to track the latest txn. Thus during these periods of heavy AS traffic the SerializationError is much more likely to occur.

@reivilibre reivilibre removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jan 10, 2022
@Fizzadar
Copy link
Contributor Author

Fizzadar commented Jan 19, 2022

Looping back on this, I may have some time later this week to work on a fix so would be good to agree an apporach.

I believe the crux of the problem is that the application_services_state table is used both to track ephemeral stream ID position (read_receipt_stream_id & presence_stream_id) and also the AS transaction state (state, last_txn). The issue with this approach is that the AS handler and AS sending work independently of each other, leading to serialization errors on the table.

Did a quick scan of the issues - perhaps #11567 is the resolution here to allow parallel updates to the different columns (unsure if that's the case with the changed isolation level)?

@squahtx
Copy link
Contributor

squahtx commented Feb 17, 2022

Did a quick scan of the issues - perhaps #11567 is the resolution here to allow parallel updates to the different columns (unsure if that's the case with the changed isolation level)?

Using the READ COMMITTED isolation level seems like a good fix:

With REPEATABLE READ

connection 1 connection 2
BEGIN TRANSACTION; BEGIN TRANSACTION;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test';
UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test';
    (blocks)
COMMIT;     ERROR: could not serialize access due to concurrent update

With READ COMMITTED

connection 1 connection 2
BEGIN TRANSACTION; BEGIN TRANSACTION;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED; SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test';
UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test';
    (blocks)
COMMIT;     (unblocks)
COMMIT;
SELECT * FROM application_services_state WHERE as_id = 'test';
 as_id | state | last_txn | read_receipt_stream_id | presence_stream_id | to_device_stream_id
-------+-------+----------+------------------------+--------------------+---------------------
 test  |       |        2 |                     20 |                    |
(1 row)

@squahtx squahtx removed their assignment Feb 17, 2022
@erikjohnston erikjohnston added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Feb 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
5 participants