Collect outgoing read-receipts into buckets #4777

richvdh · 2019-03-01T16:56:35Z

In order to reduce the number of outgoing federation transactions, we want
to aggregate read-receipts, since there is little harm if outgoing RRs are
delayed by a few seconds, and we can dramatically reduce the number of
transactions by doing so.

This change introduces the concept of EDU 'buckets'; currently we have the
'Instant' bucket and the 'Delayed' bucket.

Builds on #4770, #4771

Fixes #4730, #3951.

In worker mode, on the federation sender, when we receive an edu for sending over the replication socket, it is parsed into an Edu object. There is no point extracting the contents of it so that we can then immediately build another Edu.

In order to reduce the number of outgoing federation transactions, we want to aggregate read-receipts, since there is little harm if outgoing RRs are delayed by a few seconds, and we can dramatically reduce the number of transactions by doing so. This change introduces the concept of EDU 'buckets'; currently we have the 'Instant' bucket and the 'Delayed' bucket. Fixes #4730, #3951.

codecov · 2019-03-01T17:12:18Z

Codecov Report

Merging #4777 into develop will decrease coverage by 0.02%.
The diff coverage is 70.94%.

@@             Coverage Diff             @@
##           develop    #4777      +/-   ##
===========================================
- Coverage    75.09%   75.06%   -0.03%     
===========================================
  Files          340      340              
  Lines        34923    35147     +224     
  Branches      5723     5792      +69     
===========================================
+ Hits         26225    26383     +158     
- Misses        7088     7132      +44     
- Partials      1610     1632      +22

ara4n · 2019-03-01T18:58:14Z

i'm a bit worried that 5s is quite a long time to delay outbound RRs for: if i'm talking to someone over federation, i'd be a bit worried if it seemed there was a 5s delay on them reading my messages every time I said something - and i'd be surprised if they started sending responses to my messages before I'd received their RR. Can we make the 5s configurable? Or only kick in if we're overloaded?

ara4n · 2019-03-03T08:24:06Z

actually, given RRs cosmetically update whenever you receive a msg from someone, perhaps it’s not that bad. would still be nice if it only kicked in during crises tho?

turt2live · 2019-03-03T08:40:22Z

It's really disorienting as-is having them be a couple seconds delayed. I would highly suggest this only kicks in when there's significant load.

richvdh · 2019-03-04T09:47:27Z

hrm. Reliably determining that there is significant load is tricky.

erikjohnston · 2019-03-04T10:29:21Z

I wonder if we should only batch up read receipts which are a fair bit older than the event? As if the read receipt is sent withing a few 100ms of the message being sent its quite bad to then delay it by several seconds, but if the read receipt is sent several minutes after then its not the end of the world to delay it being sent by a few seconds.

The only annoyance I can think of there is if a remote user sees a typing notification before a read receipt from the user, but skimming the code it appears that a typing notification will (probably) cause the read receipt to also be sent?

richvdh · 2019-03-04T13:55:44Z

I wonder if we should only batch up read receipts which are a fair bit older than the event? As if the read receipt is sent withing a few 100ms of the message being sent its quite bad to then delay it by several seconds, but if the read receipt is sent several minutes after then its not the end of the world to delay it being sent by a few seconds.

I think this will do little to deal with the main problem, which is that, during a period of busy traffic in a large room, we send a new federation transaction to every server in the room every time someone on matrix.org reads each message.

I'll try and get some actual data on this.

skimming the code it appears that a typing notification will (probably) cause the read receipt to also be sent

That's the idea, yes.

…g_edus

erikjohnston · 2019-03-04T14:01:49Z

I wonder if we should only batch up read receipts which are a fair bit older than the event? As if the read receipt is sent withing a few 100ms of the message being sent its quite bad to then delay it by several seconds, but if the read receipt is sent several minutes after then its not the end of the world to delay it being sent by a few seconds.

I think this will do little to deal with the main problem, which is that, during a period of busy traffic in a large room, we send a new federation transaction to every server in the room every time someone on matrix.org reads each message.

TBH, my hunch is that it will help if we set it at 30s or a minute, but getting some metrics on it would be grand

erikjohnston

This looks broadly good, I've mainly struggled to follow along with PerDestinationQueue and EduTransmissionBucket. I worry a bit that being generic over the buckets has added unnecessary complexity to the thing. I've added some thoughts/suggestions/comments around parts I find a bit confusing, though they don't necessarily point to an obvious solution.

I wonder if we should change PerDestinationQueue to:

Have an add_edu(edu, bucket) API, rather than returning a bucket to avoid races
Have either a) two hardcoded buckets for "instant" vs "delayed" EDUs or b) a PriorityQueue which maintains the correct order. This saves from having to figure out when to create/destroy buckets.
When adding an edu call attempt_new_transaction immediately if edu has instant bucket
Move the transmission clock in the PerDestinationQueue, since there should only be one per host, rather than one per bucket.

erikjohnston · 2019-03-04T14:50:20Z