Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Room retention *increases* size of state tables. #9406

Open
matrixbot opened this issue Dec 18, 2023 · 4 comments
Open

Room retention *increases* size of state tables. #9406

matrixbot opened this issue Dec 18, 2023 · 4 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 18, 2023

This issue has been migrated from #9406.


We have reports that enabling room retention in a very large room (with frequent joins and leaves) actually causes the size of the state tables to increase, which is surprising.

Note that it appears that postgres is vacuuming the tables.

Also note that this is an experimental feature.

@matrixbot matrixbot changed the title Dummy issue Room retention *increases* size of state tables. Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
@kaiyou
Copy link
Contributor

kaiyou commented Jan 15, 2024

Hello, very possibly the report came from a discussion I initiated on some Matrix admin related rooms, since we were facing such an issue and trying to get to the bottom of it before opening a thread on GitHub. We are however unable to pinpoint the exact source of defect, so here are as many details as I can collect, seeking your help.

We are running a dedicated HS for bridges. Thus most rooms are initiated by appservices, have many (1-20k) local users from appservices and a handful remote users from (multiple) consuming HS (1-20).

We recently decided to enable a retention policy on the server, deleting history, including local history, older than 6 months. Our rationale was guided by our ToS clearly stating we are not entitled to storing bridged or cached data for more than 6 months, plus a need to purge old events and associated media, which have stacked TB of data from bridging high activity Telegram rooms for instance.

We are aware this might interfere with history purge on other homeservers: a seemingly non destructive operation of purging remote history elsewhere might become destructive because when paginating, the remote HS will get empty history from us. We are okay with this from a user experience perspective.

The failure

We woke up to a full drive, which filled overnight, so we were unable to respond in time and postgres was now read-only with no room for vacuuming.

Our bridges database had grown to over > 400 GB. Culprit was clearly state management, with state_groups_state growing x10 in rows and x4 in size (had not been vacuumed for a while). No state autocompressor was running at the time, since we usually wait for a couple days/weeks after big configuration changes before we switch synatainer back on.

The facts

After failing to identify anything obvious in the logs or data, not finding any open issue on the matter, we pulled the database to a high performance machine for analytics, we gathered metrics from rooms and room state in hope of uncovering a pattern.

Room count was stable.

bridges=# select count(*) from rooms;
 count 
-------
  8734

Storing state group entries was involved indeed, here the total row count. No precise metric about previous value, but we believe it was around 50M, so this is a 10-times increase overnight.

bridges=# select count(*) from state_groups_state ;
   count   
-----------
 591294362
(1 row)

After creating a couple views for investigating, we checked for the frequent issue of orphaned state groups:

bridges=# select cast(sum(case when r.room_id is null then 1 else 0 end) as float) / sum(1) from investigation_rooms_by_state_groups i left join rooms r on i.room_id=r.room_id;
      ?column?      
--------------------
 0.4645638879948503
(1 row)


bridges=# select cast(sum(case when r.room_id is null then cnt else 0 end) as float) / sum(cnt) from investigation_rooms_by_state_groups i left join rooms r on i.room_id=r.room_id;
       ?column?       
----------------------
 0.044112372240072195
(1 row)

They do indeed make up for half the rooms represented in state groups, yet only 5% of the rows, so we discarded it as a culprit and decided to come back and clean this later.

We sorted rooms by number of state group entries (not state groups). We frequently monitor those when trying to slim down our database, so we were able to identify that top 10 rooms from the previous week were now barely making top 50. Top 3 rooms were showing 50M, 50M and 40M state group entries respectively, so we decided to focus on top 1 and see if any pattern applied to the others.

This was no new room, and latest event was a week old already:

bridges=# select to_timestamp(min(received_ts)/1000), to_timestamp(max(received_ts)/1000) from events where room_id='!snip';
      to_timestamp      |      to_timestamp      
------------------------+------------------------
 2022-05-22 01:46:22+00 | 2023-12-26 10:16:56+00
(1 row)

Most events were from initially joining the room, then not much happened in terms as state, and presumably if messages were sent, they were sent prior to retention period and now deleted. None of the actual users in the room were from our own HS so we were unable to properly confirm that last hypothesis.

bridges=# select to_timestamp(received_ts/1000)::date dat, count(*) cnt from events where room_id='!snip' group by dat;
    dat     |  cnt  
------------+-------
 2022-09-26 |     1
 2022-12-07 |     1
 2023-09-23 |     1
 2023-12-24 |     2
 2023-12-26 |     2
 2023-08-03 |     2
 2022-05-22 |  5104
 2022-05-23 | 14914
 2023-06-09 |     1
 2023-02-16 |     1
 2023-10-05 |     1
 2023-07-03 |     1
 2023-09-07 |     1
(13 rows)

Also close to every state event is a invite+join, which backs our understanding:

bridges=# select type, count(*) cnt from events where room_id='!snip' group by type;
           type            |  cnt  
---------------------------+-------
 m.room.create             |     1
 uk.half-shot.bridge       |     2
 m.room.join_rules         |     1
 m.bridge                  |     2
 m.room.guest_access       |     1
 m.room.name               |     1
 m.room.power_levels       |     1
 m.room.member             | 20021
 m.room.history_visibility |     1
 m.room.avatar             |     1
(10 rows)

The aha moment was realizing that 20k invites+joins is 10k joins, thus 10k members if no one leaves, while the sum of 10k first natural integers is very close to 50M (rounding n²/2-n/2 to n² is a reasonable approximate). Supposedly there were now 1 state group per join event (confirmed), and each state groups had 1 entry per member, amounting to a terrible 50M rows.

On a bridges server, many rooms show this pattern of state events mostly comprised of invite+join and very little people leaving, either massively when initially bridging or over time. We tried to spot the pattern in other rooms.

Here we naively count events and state group entries. If the room is mostly comprised of join and invite events, with no or very little history, and our hypothesis holds, then these room will have in the order of n²/2 state group entries, with n half the number of events (half of 1 invite + 1 join), so count(events)²/8. This is a terrible approximate merely designed to check if the pattern held at all in top rooms:

bridges=# select cnt, evts, 8*cast(cnt as float)/evts/evts from investigation_rooms_by_state_groups s left join investigation_rooms_by_events e using (room_id) order by s.cnt desc limit 20;
   cnt    | evts  |      ?column?      
----------+-------+--------------------
 50155758 | 20032 | 0.9999128790101971
 50033393 | 20214 | 0.9795924284931093
 39997112 |  6479 |  7.622591763037676
 37804143 | 18137 | 0.9193872533552278
 30092162 | 19000 | 0.6668623157894736
 28028503 |  7494 | 3.9926616856157287
 22343796 | 15006 | 0.7938109042668419
 18289653 | 12247 | 0.9755196395757239
 16237140 | 11414 | 0.9970657217354616       
 11381242 | 12712 | 0.5634454503725067
 11218030 |  4734 |  4.004516402748751
  9617828 |  9132 | 0.9226461789121406
  9561422 |  9314 | 0.8817389104738831
  9500159 |  8226 |  1.123164713072573
  8018862 |  6182 |  1.678590361744646
  7773328 |  7849 |  1.009411671935376
  7466459 |  7779 | 0.9870908556416975
  7167351 |  9294 | 0.6638092868456197
(20 rows)

We were not expecting every room exhibit the pattern (ratio close to 1), but we were not expecting so many either. This was our smoking gun. Somehow, somewhere, something was causing state groups to expand to a full decompressed form.

Our interpretation

Down the rabbit hole we went. Synapse code very rarely creates state group entries. In fact it does in two different places: when a state event is stored, trying to de-duplicate reasonably by creating state group edges between state groups (the very structure that state compressors try to optimize further), and right here when purging unreferenced state groups:

# Now we turn the state groups that reference to-be-deleted state

And purging unreferenced state groups is exactly what purging events does after computing a list of said groups:

await self.stores.state.purge_unreferenced_state_groups(

It is in fact the only place where state groups are deleted and recreated except for normal event processing. And it is the very routine called by retention jobs.

We think that in some rooms at least, depending on factors we were not able to pinpoint, after purging history, the purge routine determines that some events are unreferenced and need purging, the same events that are holding the state group hierarchy, leading to the entire hierarchy being de-deduplicated, and enormous amounts of state group entries being created.

Our current status

This is as much as we can go from our understanding of synapse internals.

We did not have enough time yet to try and reproduce this on a small HS with small rooms.

We gambled on the hypothesis that this would happen when retention jobs first cleans the room history, and we should be safe compressing the state table and be on our way.

It did compress very well to about 5% of its initial size by manually running the state compressor with default settings on every room. If you remember about orphaned state groups from initial findings, these amounted to pretty much anything left and our state_groups_state table shrinked 100x. Big win. We even posited this lead to better compressed output than running the compressor against the original state groups (quite unsure about this one though). One vacuum full later we were up and running.

It did fine for a couple days. Until retention jobs kicked in, and everything is now de-deduplicated again.

We are thinking about running that database on a separate host until we find a more stable fix and are not trying to compress that table anymore.

We do need help to go further, even pointers to places in the codebase we would have missed or ideas for reproducing this in a test case.

@kaiyou
Copy link
Contributor

kaiyou commented Jan 15, 2024

Oh, the original report is pretty old already, no idea why I did not find it in the first place, but I surely was not the original reporter since we experienced this just weeks ago.

@kaiyou
Copy link
Contributor

kaiyou commented Jan 15, 2024

Sorry about spamming this issue. After a first pessimistic report, it appears that rooms that were "de-deduplicated" this time are different from the first time.

My take on this is the retention job did not finish last time, since the database crashed before it could. We were lucky that remaining rooms did not nuke our server once more. Symptoms are similar and the same pattern emerges, this time with even more entries: one room show 180M rows in state_groups_state :)

I believe we are now out of most trouble, yet I will keep a copy of the database in case I can help pinpointing the root cause.

@kaiyou
Copy link
Contributor

kaiyou commented May 30, 2024

A couple months later, this happened again on our HS, we were lucky to avoid a major downtime, so I went back down the rabbit hole of the history purging code in synapse. Here is a summary of what I found, and how I suggest we fix some of it.

The process of locating state groups to be deleted is complex and looks flawed

First, history purging happens at :

This is supposed to purge old events and their relationships, then return every state group to be considered first for deletion. It does this by returning every state group related to any event, including state events, that matches purge criteria, except for outliers. It also ends by removing the reference to state groups from state events matching the criteria and making them outliers. This means those same state events will never be considered in later purges (which explicitly exclude outliers).

I am unsure this is an actual issue, since associated state groups should be deleted, and we might not care about these later, but I would feel more confident if we included more state groups as candidates than leaving them potentially forever. Maybe this is also related to accumulating unreferenced state groups.

Later on, this initial list is expanded and filtered here:

async def _find_unreferenced_groups(self, state_groups: Set[int]) -> Set[int]:

Which basically climbs up the state group tree and stops when finding state groups that are still referenced and thus should not be deleted. This is fine, except it climbs up the tree using get_previous_state_groups:

async def get_previous_state_groups(

From a quick overview, this does not return previous state groups but filters the state group list, keeping those that have children. This in turn breaks the code from _find_unreferenced_groups in a way that simply does not climb up the tree, leaving additional unreferenced groups, possibly forever.

Here is a reference to the separate issue about unreferenced state groups if my analysis can help there: matrix-org/synapse#12821

I suggest:

  • fixing the bug in get_previous_state_groups
  • at least add some documentation to purging code explaining why we dereference state groups, set state events as outliers, etc.

State groups are stored indefinitely for rejected client events

When creating client events, synapse stores them very early in an UnpersistedEventContext. This is done most notably by calculate_context_info, which stores references to the event state, thus creating state groups, before the event is checked for authorization, long before it is persisted.

For instance, a client event for joining an invite-only room will generate a state group, even if it is rejected. This leads to unreferenced entries in the state group tree, which also reference unpersisted events.

I am unsure how to proceed about this. On paper, state groups are merely a structured cache, so it should do no harm except for taking up space that is never reclaimed, while changing this behavior sounds like a colossal task.

Application services weaponize this

At least, Mautrix does.

https://github.com/mautrix/go/blob/57f6cd89e3bf148e7ab497ef076e054fd506e2ab/appservice/intent.go#L91

For every bridged user, instead of checking the room permissions, Mautrix simply tries to join, then upon failure invites the user and joins again. It works fine. Except synapse stores an unreferenced state group in the tree for every bridged user, due to the denied initial join.

A bridged room state group tree is pretty straightforward, except for every user being mentioned three times: one invite, one join, and one unreferenced branch. Due to being unreferenced, it is not considered for deletion when purging the room. However, when removing the main branch, all those state groups are de-delta-ed.

# Now we turn the state groups that reference to-be-deleted state

So after a history purge, a bridged room has one unreferenced de-delta-ed state group for every room member. This grows quadratically as exposed above, and large rooms may create dozens or hundreds of millions of state group lines in a couple of minutes when purging history due to this, effectively taking the database down.

I suggest:

  • for now, stop cleaning state groups when purging history, this leaves unreferenced entries for cleaning scripts to scrub, but I would rather have rotting data than exploding data,
  • review the root causes above, as this is hardly an issue with Mautrix and could be exploited in many other ways, including intentionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants