Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Implement MSC3440 relation filtering #11086

Closed
4 tasks
Tracked by #3
clokep opened this issue Oct 14, 2021 · 5 comments · Fixed by #11236
Closed
4 tasks
Tracked by #3

Implement MSC3440 relation filtering #11086

clokep opened this issue Oct 14, 2021 · 5 comments · Fixed by #11236
Assignees
Labels
A-Threads Threaded messages T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@clokep
Copy link
Member

clokep commented Oct 14, 2021

MSC3440 suggests implementing filtering on relations, specifically filtering for parent events which have:

  • relation_types: Another event referring to them of particular relation type(s).
  • relation_senders: Another event referring to them from particular sender(s).

In order to support this we'll need to:

  • Update the filtering code to be async.
  • Update the filtering code to store/process the new filter fields.
  • Update the filtering code to fetch the event relations and filter on them.
  • Update the message pagination code to filter based on the above information.

These will have an unstable values of io.element.relation_types and io.element.relation_senders as filter keys.

@clokep clokep added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Oct 14, 2021
@clokep clokep self-assigned this Oct 14, 2021
@clokep clokep added the A-Threads Threaded messages label Oct 15, 2021
@clokep
Copy link
Member Author

clokep commented Oct 18, 2021

Not identical, but #6301 does some similar things to what we would need.

A major difference for this, compared to the other ways a message is currently filtered, is that this depends on other events, not the current event. This would require additional data to be pulled from the database (I think), which is rather unfortunate and has annoying refactoring effects (mainly that a variety of methods need to be made async to deal with this).

@clokep
Copy link
Member Author

clokep commented Oct 19, 2021

The problem I'm running into right now is that this filtering is not done on the event itself, but events that reference this event. This is something new and it isn't clear how to tackle it. We also do filtering in two places:

  1. synapse.storage.databases.main.stream.StreamWorkerStore._paginate_room_events_txn where we filter in the database.
  2. The various callers of synapse.api.filtering.Filter.filter (mostly the synapse.api.filtering.FilterCollection.filter_* methods), where filtering happens against EventBase (and other) objects.

I think I see two approaches of how we can handle this:

Approach 1

  1. Update the methods of Filter to be async (and any callers -- they all seem to end up encapsulated by async methods eventually).
  2. If necessary during filtering, fetch the relations from the database to filter against for each event.

This is somewhat undesirable as it turns Filter into something which needs to know about the database instead of a relatively simple object. It also causes needless complexity for other type of obejcts which use Filter (e.g. account data uses the same object eventually, we could probably handle that OK or just accept a tiny bit of overhead for sometimes using Awaitable instances when not needed).

Approach 2

  1. Fetch the relations from the database when fetching events (similar to what we do for redactions).
  2. Pipe this data through to the EventBase instance (probably via the internal metadata?)
  3. Update Filter to use the already-pulled relations.

This is somewhat undesirable since we'll sometimes pull data that is unused, it could also be a little finicky with caches (depending on details). It keeps Filter simple, however. This approach could also potentially allow serialize_event to not have to pull additional data from the database, but also might break down in situations where there's too many relations for a particular event.


Would appreciate if I've missed anything with the above or if there are other ideas that people have!

@babolivier
Copy link
Contributor

Disclaimer: I'm not super knowledgeable on the filtering code (I touched it a bit at some point but that was probably over a year ago), so I wouldn't be the best person to know of alternative approaches or gotchas.

I would say approach 1 sounds better to me because it minimises the amount of data we need to pull from the db, so we should in theory be saving on db time and cache sizes. Though it's a bit of a pain if we need to do it event by event rather than being able to do it in bulk - but it sounds like it could still be manageable?

On the other hand I'm wondering if, since _paginate_room_events_txn does filtering in the database (and presumably we'll need to support relation filtering there as well?) we are not creating extra work if we're also filtering in the Filter class?

Maybe there's an in-between solution, something like fetching a limited amount of data about relations (e.g. just the event IDs) from the database when we're retrieving events, and then, outside of Filter.check, see if we need to fetch relations, and if so fetch the full events?

@erikjohnston
Copy link
Member

FTR the way filtering works is that Filter.check needs to correctly filter all events, as the filtering done in _paginate_room_events_txn is just best effort and may well return more events than necessary.


I prefer approach 1, I think if we're adding filter clauses that reference things external to the event it makes sense that the Filter object then needs to become aware of the wider state.

One other thing we could potentially do (riffing off Patrick's and Brendan's ideas) is to make _paginate_room_events_txn pull out the necessary info when it sees the new clauses, and then have the Filter use those if they're present and otherwise hitting the DB if they're not. (What I really don't want to do is make it so that Filter assumes you fetched the events from _paginate_room_events_txn)

@clokep
Copy link
Member Author

clokep commented Oct 21, 2021

Thanks both! I think your idea make sense! I'm unsure if we'll be able to pull out additional data in _paginate_room_events_txn, but I think it might be doable?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Threads Threaded messages T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants