-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix a bug where servers could be marked as up when they were failing #16506
Conversation
@@ -0,0 +1 @@ | |||
Fix a bug introduced in Synapse 1.59.0 where servers would be incorrectly marked as available when a request resulted in an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a more visible symptom? Perhaps it would cause things to be retried too often?
synapse/util/retryutils.py
Outdated
# the notifier. | ||
self.replication_client.send_remote_server_up(self.destination) | ||
# If the server was previously failing, but is no longer. | ||
if previously_failing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikjohnston this might need some thoughts from you as the original author of #12500 -- was this done on purpose and I'm missing some understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is not quite right still, it will end up calling this code if we were previously failing & still failing. I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is not quite right still, it will end up calling this code if we were previously failing & still failing. I think?
It should be OK now. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic could be simplified to only check not currently_failing
, which depends on the earlier return to not kick off the background process. But I find this a bit clearer to include previously_failing and not currently_failing
. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems legit to me!
As described in #15226, the
RetryDestinationLimiter
seems to mark servers as up when the connection fails. It seems to be a regression from #12500, see #12500 (comment).I think this will cause the notifier to be woken up and additional replication traffic. I think it will then cause federation traffic via a new transaction being sent to the unreachable server?