Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change exponential backoff algorithm for federation send #4597

Merged
merged 6 commits into from
Apr 9, 2024

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented Apr 5, 2024

Lemmy.world had a problem with incoming federation where /inbox requests were failing, so FederationQueueState::fail_count kept increasing. On lemmy.ml the value reached 16, which results in a sleep of 18 hours before the next activity send would be attempted. This is way too long so that it was necessary to reset the count manually.

This shouldnt be necessary, Lemmy shouldnt wait such a long time so Ive added a limit of one hour. Alternatively it may be possible to calculate the sleep interval differently instead of 2^n but I didnt find any resources for exponential backoff which match our use case.

@phiresky I noticed another problem with the outgoing federation workers. It looks like the list of dead and alive instances is never reloaded after starting. So if an instance is marked dead when Lemmy starts, it will never attempt to send any activities until Lemmy is restarted. Similarly, if an instance is alive at the start and marked as dead later, Lemmy will still attempt to send activities forever, according to the sleep interval. Am I missing anything, or do you have an idea how to fix this?

@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1

if [ -n "$PACKAGE" ];
then
cargo test -p $PACKAGE --all-features --no-fail-fast
cargo test -p $PACKAGE --all-features --no-fail-fast $TEST
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can run a single test with this without manually editing the script each time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@phiresky
Copy link
Collaborator

phiresky commented Apr 5, 2024

calculate the sleep interval differently instead

Right now, the base of the exponent is 2. This means that for any downtime of X hours, the maximum time until the federation resumes is another X hours - basically at maximum doubling the downtime. The expected / average value should be half the downtime I think (due to discretization).

Maybe just changing the constant to e.g. 1.5 or 1.25 would be good? That would mean at maximum the time to resume would be 1/2 or 1/4 of the downtime.

Here's a simulation of downtimes with different exponential delays:

2**retries (current in lemmy)

Instance DowntimeFederation Resumed At (Including Downtime)Total Retry Attempts
1m 0s1m 3s6
10m 0s17m 3s10
30m 0s34m 7s11
1h 0m1h 8m12
2h 0m2h 16m13
3h 0m4h 33m14
6h 0m9h 6m15
12h 0m18h 12m16
24h 0m36h 24m17
48h 0m72h 49m18

1.5**retries

Instance DowntimeFederation Resumed At (Including Downtime)Total Retry Attempts
1m 0s1m 15s9
10m 0s14m 34s15
30m 0s32m 49s17
1h 0m1h 13m19
2h 0m2h 46m21
3h 0m4h 9m22
6h 0m6h 14m23
12h 0m14h 1m25
24h 0m31h 33m27
48h 0m71h 1m29

1.25**retries

Instance DowntimeFederation Resumed At (Including Downtime)Total Retry Attempts
1m 0s1m 9s13
10m 0s11m 14s23
30m 0s34m 24s28
1h 0m1h 7m31
2h 0m2h 11m34
3h 0m3h 25m36
6h 0m6h 41m39
12h 0m13h 3m42
24h 0m25h 30m45
48h 0m49h 49m48

code to generate these tables

Also I was actually thinking that it might be smart to go the opposite direction than this PR and remove the "dead" flag altogether and instead just keep exponentially increasing the retry delays for all instances - maybe with a limit of a week or so.

Capping at an hour I guess kinda works, then we bascially have an exponential delay between 0s and 1h, then fixed at every 1h for a day, and after that fixed at 24h (the dead check). But reducing the base of the exponentiation seems cleaner.

@phiresky
Copy link
Collaborator

phiresky commented Apr 5, 2024

I noticed another problem with the outgoing federation workers. It looks like the list of dead and alive instances is never reloaded after starting.

I don't think that's true, the list of instances is fully refreshed in the loop in start_stop_federation_workers, which happens every INSTANCES_RECHECK_DELAY (once per minute)

@Nutomic
Copy link
Member Author

Nutomic commented Apr 5, 2024

Also I was actually thinking that it might be smart to go the opposite direction than this PR and remove the "dead" flag altogether and instead just keep exponentially increasing the retry delays for all instances - maybe with a limit of a week or so.

That seems like a good idea, it means there is one less value to check and less complexity. However I dont like the values you are showing. If an instance is down for 3 hours it will be defederated for 6.5 hours before federation resumes. That doesnt make sense because its extremely cheap to send an inbox request, so we can easily do it once per hour during the entire first day. For the limit one day should be fine, its what we are using now as well.

I don't think that's true, the list of instances is fully refreshed in the loop in start_stop_federation_workers, which happens every INSTANCES_RECHECK_DELAY (once per minute)

True, I missed the loop there.

@phiresky
Copy link
Collaborator

phiresky commented Apr 5, 2024

If an instance is down for 3 hours it will be defederated for 6.5 hours

I think you might be misreading my tables (they are a bit confusing), the second column is the total duration including the downtime, so for the 3 h downtime the additional delay until federation resumes would be:

backoff additional downtime
2^N 1h 33 min
1.5^N 1h 09 min
1.25^N 0h 25 min

With 1.25^N the delay until refederation is around or less than your fixed 1h limit all the way up to an instance downtime of > 24h

@Nutomic
Copy link
Member Author

Nutomic commented Apr 8, 2024

Okay in that case it sounds good, Ive changed it to 1.25^n. And I also changed it to ignore the first error, so it doesnt sleep over a second with only a single failure. Instead there need to be at least two failures before it starts sleeping.

I also thought about getting not marking instances as dead anymore. But then we would have to keep thousands of federation workers around for dead instances which do nothing but sleep for a very long time. We could instead mark instances as dead based on last_successful_published_time, but that isnt available in start_stop_federation_workers() so it would be unnecessarily complicated.

@Nutomic Nutomic changed the title Limit federation send retry interval to one hour Change exponential backoff algorithm for federation send Apr 8, 2024
@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1

if [ -n "$PACKAGE" ];
then
cargo test -p $PACKAGE --all-features --no-fail-fast
cargo test -p $PACKAGE --all-features --no-fail-fast $TEST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@dessalines dessalines merged commit b467098 into main Apr 9, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants