-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Collect outgoing read-receipts into buckets #4777
Conversation
In worker mode, on the federation sender, when we receive an edu for sending over the replication socket, it is parsed into an Edu object. There is no point extracting the contents of it so that we can then immediately build another Edu.
In order to reduce the number of outgoing federation transactions, we want to aggregate read-receipts, since there is little harm if outgoing RRs are delayed by a few seconds, and we can dramatically reduce the number of transactions by doing so. This change introduces the concept of EDU 'buckets'; currently we have the 'Instant' bucket and the 'Delayed' bucket. Fixes #4730, #3951.
Codecov Report
@@ Coverage Diff @@
## develop #4777 +/- ##
===========================================
- Coverage 75.09% 75.06% -0.03%
===========================================
Files 340 340
Lines 34923 35147 +224
Branches 5723 5792 +69
===========================================
+ Hits 26225 26383 +158
- Misses 7088 7132 +44
- Partials 1610 1632 +22 |
i'm a bit worried that 5s is quite a long time to delay outbound RRs for: if i'm talking to someone over federation, i'd be a bit worried if it seemed there was a 5s delay on them reading my messages every time I said something - and i'd be surprised if they started sending responses to my messages before I'd received their RR. Can we make the 5s configurable? Or only kick in if we're overloaded? |
actually, given RRs cosmetically update whenever you receive a msg from someone, perhaps it’s not that bad. would still be nice if it only kicked in during crises tho? |
It's really disorienting as-is having them be a couple seconds delayed. I would highly suggest this only kicks in when there's significant load. |
hrm. Reliably determining that there is significant load is tricky. |
I wonder if we should only batch up read receipts which are a fair bit older than the event? As if the read receipt is sent withing a few 100ms of the message being sent its quite bad to then delay it by several seconds, but if the read receipt is sent several minutes after then its not the end of the world to delay it being sent by a few seconds. The only annoyance I can think of there is if a remote user sees a typing notification before a read receipt from the user, but skimming the code it appears that a typing notification will (probably) cause the read receipt to also be sent? |
I think this will do little to deal with the main problem, which is that, during a period of busy traffic in a large room, we send a new federation transaction to every server in the room every time someone on matrix.org reads each message. I'll try and get some actual data on this.
That's the idea, yes. |
TBH, my hunch is that it will help if we set it at 30s or a minute, but getting some metrics on it would be grand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks broadly good, I've mainly struggled to follow along with PerDestinationQueue
and EduTransmissionBucket
. I worry a bit that being generic over the buckets has added unnecessary complexity to the thing. I've added some thoughts/suggestions/comments around parts I find a bit confusing, though they don't necessarily point to an obvious solution.
I wonder if we should change PerDestinationQueue
to:
- Have an
add_edu(edu, bucket)
API, rather than returning a bucket to avoid races - Have either a) two hardcoded buckets for "instant" vs "delayed" EDUs or b) a PriorityQueue which maintains the correct order. This saves from having to figure out when to create/destroy buckets.
- When adding an edu call
attempt_new_transaction
immediately if edu has instant bucket - Move the transmission clock in the
PerDestinationQueue
, since there should only be one per host, rather than one per bucket.
"""Construct an Edu object, and queue it for sending | ||
|
||
Args: | ||
destination (str): name of server to send to | ||
edu_type (str): type of EDU to send | ||
content (dict): content of EDU | ||
key (Any|None): clobbering key for this edu | ||
bucket_id (int|None): fixme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixme
finally: | ||
# if the bucket is now empty, we need to destroy it. | ||
# (we do this in a finally block so that if the caller pops exactly the | ||
# right number of EDUs, it is still called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the finally block gets called in that case either?
attempt_transaction_cb = attr.ib() # type: callable | ||
|
||
# edu buckets for this destination, keyed by edu bucket ID | ||
edu_buckets = attr.ib(factory=dict) # type: dict[int, EduTransmissionBucket] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there are only two buckets, I'd be sorely tempted to just have them as two attributes. It also saves from having to figure out when to create/delete a bucket or not.
@@ -714,3 +747,183 @@ def json_data_cb(): | |||
success = False | |||
|
|||
defer.returnValue(success) | |||
|
|||
|
|||
@attr.s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to set slots=True
task.cancel() | ||
|
||
|
||
@attr.s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to set slots=True
queue = self.pending_edus_by_dest.get(dest) | ||
if not queue: | ||
def tx_cb(): | ||
self._attempt_new_transaction(dest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to give PerDestinationQueue
the destination so that we don't have to keep creating new callbacks
if bucket_id == INSTANT_EDU_BUCKET_ID: | ||
delay = 0 | ||
elif bucket_id == DELAYED_EDU_BUCKET_ID: | ||
delay = 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pull this constant up to the top too? It feels like something we may want to fiddle with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want this to be smeared out a bit to further reduce any stampeding?
self.edu_buckets[bucket_id] = bucket = EduTransmissionBucket( | ||
bucket_transmission_time=transmission_time, | ||
transmission_task=transmit_task, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the management of the task happens in PerDestinationQueue
it feels odd for it to be added to the EduTransmissionBucket
class.
return bucket | ||
|
||
if bucket_id == INSTANT_EDU_BUCKET_ID: | ||
delay = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of bouncing things via the reactor given this is a hot path tbh. It's also a bit of a footgun if we accidentally yield between calling get_or_create_edu_bucket
and inserting things into the bucket.
as discussed in #4730, we're going to do this differently. |
In order to reduce the number of outgoing federation transactions, we want
to aggregate read-receipts, since there is little harm if outgoing RRs are
delayed by a few seconds, and we can dramatically reduce the number of
transactions by doing so.
This change introduces the concept of EDU 'buckets'; currently we have the
'Instant' bucket and the 'Delayed' bucket.
Builds on #4770, #4771
Fixes #4730, #3951.