SerializationFailure on appservice outgoing transaction (updating `last_txn`) #11620

Fizzadar · 2021-12-21T10:40:22Z

Description

We are seeing AS' go into recoverer mode after a serialization failure when updating application_services_state.last_txn (https://github.com/matrix-org/synapse/blob/v1.48.0/synapse/storage/databases/main/appservice.py#L272). We recently fixed a similar issue in #11195 which targets the same database row, so I suspect the issue now is that this last_txn update occaisionally clashes with the stream position update we added the linearizer for.

I see that the last_txn implementation was added a few years back (0a60bbf) and hasn't changed since - as far as I can tell the only use of this column is to check the previous value hasn't incremented more than one. This does appear to get triggered from time to time still - I can see 12 log entires in the last 7 days in our deployment.

Possible Solutions

move this column to it's own table, not great but would get around the immediate issue
fix the ordering issue (or work around it) and remove the column/check entirely

I notice the recoverer just pulls the lowest transaction and sends that - could the entire sending mechanism work just like that instead? Simply have the _TransactionController write out transactions to the database in the batches and have process (currently _Recoverer) that just pulls from these constantly. This would remove any need to track the last transaction ID for a given AS.

Version information

v1.48
worker based deployment with dedicated appservice pusher worker

The text was updated successfully, but these errors were encountered:

Fizzadar · 2021-12-22T09:41:08Z

Update on this: have been looking at the code here and I think the scheduler/AS handler is likely to drop events during a restart of synapse - notably the events are batched in memory as attributes on the _ServiceQueuer instance.

This means if the AS pusher instance is in recoverer mode and during that time is restarted, any events sat in memory will be lost as they don't exist on the database. The AS handler completes the notify_interested_services background task without any confirmation the events have been stored as an AS transaction in the database.

Note: this appears to be the case when not in recoverer mode as well, if a request to the AS is ongoing any events currently being held in memory would also be lost upon restart of the instance.

richvdh · 2021-12-23T10:35:28Z

SerializationFailures aren't, in themselves, problems - we will generally retry any transactions that hit a SerializationFailure. In any case, we should frame this issue in terms of the user-visible symptom, rather than the cause. You say the AS goes into recoverer (recovery?) mode, but again, I'm not sure what impact that actually has on the system - please could you clarify.

Dropping events sounds like a different problem - I suggest opening a separate issue for that.

In general the AS support code isn't something that changes often, so I for one am not really familiar with how it works (or even how it's meant to work), which means it's hard to answer questions like "could the sending mechanism work differently".

Fizzadar · 2021-12-23T11:19:54Z

SerializationFailures aren't, in themselves, problems - we will generally retry any transactions that hit a SerializationFailure. In any case, we should frame this issue in terms of the user-visible symptom, rather than the cause. You say the AS goes into recoverer (recovery?) mode, but again, I'm not sure what impact that actually has on the system - please could you clarify.

👍 wasn't clear there - just re-checked and the problem we're seeing looks like:

Encounter SerializationError in normal transaction handling
Enter recovery mode
Encounter a second SerializationError whilst in recovery
Recoverer then triggers the backoff code, which causes transactions to build up

Neither the AS or appservice pusher worker appear to be under any load so it should be possible to process transactions much quicker. The serialization error, although not a direct problem, is causing the backoff which is a problem. I think this better explains the issue we're seeing!

Dropping events sounds like a different problem - I suggest opening a separate issue for that.

👍 - we did some poking around what transactions generally look like and they're nowhere near the 100 event limit, so the issue here is much reduced, will open another issue to track it.

richvdh · 2021-12-23T11:26:10Z

Recoverer then triggers the backoff code, which causes transactions to build up

so the end result is slow delivery to the AS?

Fizzadar · 2021-12-23T11:34:40Z

Recoverer then triggers the backoff code, which causes transactions to build up

so the end result is slow delivery to the AS?

Yep, and this leads to a backlog of AS transactions to process and we end up in a situation where the incoming AS txn rate is greater than processed.

I believe part of the issue is that during high AS traffic the application_services_state table is written to by both the handler to update the stream position and also the scheduler to track the latest txn. Thus during these periods of heavy AS traffic the SerializationError is much more likely to occur.

Fizzadar · 2022-01-19T16:49:55Z

Looping back on this, I may have some time later this week to work on a fix so would be good to agree an apporach.

I believe the crux of the problem is that the application_services_state table is used both to track ephemeral stream ID position (read_receipt_stream_id & presence_stream_id) and also the AS transaction state (state, last_txn). The issue with this approach is that the AS handler and AS sending work independently of each other, leading to serialization errors on the table.

Did a quick scan of the issues - perhaps #11567 is the resolution here to allow parallel updates to the different columns (unsure if that's the case with the changed isolation level)?

squahtx · 2022-02-17T16:43:44Z

Did a quick scan of the issues - perhaps #11567 is the resolution here to allow parallel updates to the different columns (unsure if that's the case with the changed isolation level)?

Using the READ COMMITTED isolation level seems like a good fix:

With `REPEATABLE READ`

connection 1	connection 2
BEGIN TRANSACTION;	BEGIN TRANSACTION;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;	SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test';
	UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test';
	(blocks)
COMMIT;	ERROR: could not serialize access due to concurrent update

With `READ COMMITTED`

connection 1	connection 2
BEGIN TRANSACTION;	BEGIN TRANSACTION;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;	SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
UPDATE application_services_state SET last_txn = 2 WHERE as_id = 'test';
	UPDATE application_services_state SET read_receipt_stream_id = 20 WHERE as_id = 'test';
	(blocks)
COMMIT;	(unblocks)
	COMMIT;

SELECT * FROM application_services_state WHERE as_id = 'test';
 as_id | state | last_txn | read_receipt_stream_id | presence_stream_id | to_device_stream_id
-------+-------+----------+------------------------+--------------------+---------------------
 test  |       |        2 |                     20 |                    |
(1 row)

Fizzadar changed the title ~~SerializationFailure on app service outgoing transaction~~ SerializationFailure on appservice outgoing transaction (updating last_txn) Dec 21, 2021

richvdh added the A-Application-Service Related to AS support label Dec 22, 2021

richvdh added the X-Needs-Info This issue is blocked awaiting information from the reporter label Dec 23, 2021

reivilibre removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jan 10, 2022

erikjohnston assigned squahtx Feb 17, 2022

squahtx removed their assignment Feb 17, 2022

Fizzadar mentioned this issue Feb 18, 2022

AS transaction isolation tweaks #12032

Closed

4 tasks

erikjohnston added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Feb 23, 2022

Fizzadar mentioned this issue Mar 11, 2022

Use a sequence to generate AS transaction IDs, drop last_txn AS state #12209

Merged

4 tasks

richvdh closed this as completed in #12209 Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SerializationFailure on appservice outgoing transaction (updating `last_txn`) #11620

SerializationFailure on appservice outgoing transaction (updating `last_txn`) #11620

Fizzadar commented Dec 21, 2021

Fizzadar commented Dec 22, 2021

richvdh commented Dec 23, 2021

Fizzadar commented Dec 23, 2021

richvdh commented Dec 23, 2021

Fizzadar commented Dec 23, 2021

Fizzadar commented Jan 19, 2022 •

edited

Loading

squahtx commented Feb 17, 2022 •

edited

Loading

SerializationFailure on appservice outgoing transaction (updating last_txn) #11620

SerializationFailure on appservice outgoing transaction (updating last_txn) #11620

Comments

Fizzadar commented Dec 21, 2021

Description

Possible Solutions

Version information

Fizzadar commented Dec 22, 2021

richvdh commented Dec 23, 2021

Fizzadar commented Dec 23, 2021

richvdh commented Dec 23, 2021

Fizzadar commented Dec 23, 2021

Fizzadar commented Jan 19, 2022 • edited Loading

squahtx commented Feb 17, 2022 • edited Loading

With REPEATABLE READ

With READ COMMITTED

SerializationFailure on appservice outgoing transaction (updating `last_txn`) #11620

SerializationFailure on appservice outgoing transaction (updating `last_txn`) #11620

Fizzadar commented Jan 19, 2022 •

edited

Loading

squahtx commented Feb 17, 2022 •

edited

Loading

With `REPEATABLE READ`

With `READ COMMITTED`