Change exponential backoff algorithm for federation send #4597

Nutomic · 2024-04-05T09:35:09Z

Lemmy.world had a problem with incoming federation where /inbox requests were failing, so FederationQueueState::fail_count kept increasing. On lemmy.ml the value reached 16, which results in a sleep of 18 hours before the next activity send would be attempted. This is way too long so that it was necessary to reset the count manually.

This shouldnt be necessary, Lemmy shouldnt wait such a long time so Ive added a limit of one hour. Alternatively it may be possible to calculate the sleep interval differently instead of 2^n but I didnt find any resources for exponential backoff which match our use case.

@phiresky I noticed another problem with the outgoing federation workers. It looks like the list of dead and alive instances is never reloaded after starting. So if an instance is marked dead when Lemmy starts, it will never attempt to send any activities until Lemmy is restarted. Similarly, if an instance is alive at the start and marked as dead later, Lemmy will still attempt to send activities forever, according to the sleep interval. Am I missing anything, or do you have an idea how to fix this?

Nutomic · 2024-04-05T10:10:56Z

scripts/test.sh

@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1

 if [ -n "$PACKAGE" ];
 then
-  cargo test -p $PACKAGE --all-features --no-fail-fast
+  cargo test -p $PACKAGE --all-features --no-fail-fast $TEST


Can run a single test with this without manually editing the script each time.

phiresky · 2024-04-05T13:42:53Z

calculate the sleep interval differently instead

Right now, the base of the exponent is 2. This means that for any downtime of X hours, the maximum time until the federation resumes is another X hours - basically at maximum doubling the downtime. The expected / average value should be half the downtime I think (due to discretization).

Maybe just changing the constant to e.g. 1.5 or 1.25 would be good? That would mean at maximum the time to resume would be 1/2 or 1/4 of the downtime.

Here's a simulation of downtimes with different exponential delays:

`2**retries` (current in lemmy)

Instance Downtime	Federation Resumed At (Including Downtime)	Total Retry Attempts
1m 0s	1m 3s	6
10m 0s	17m 3s	10
30m 0s	34m 7s	11
1h 0m	1h 8m	12
2h 0m	2h 16m	13
3h 0m	4h 33m	14
6h 0m	9h 6m	15
12h 0m	18h 12m	16
24h 0m	36h 24m	17
48h 0m	72h 49m	18

`1.5**retries`

Instance Downtime	Federation Resumed At (Including Downtime)	Total Retry Attempts
1m 0s	1m 15s	9
10m 0s	14m 34s	15
30m 0s	32m 49s	17
1h 0m	1h 13m	19
2h 0m	2h 46m	21
3h 0m	4h 9m	22
6h 0m	6h 14m	23
12h 0m	14h 1m	25
24h 0m	31h 33m	27
48h 0m	71h 1m	29

`1.25**retries`

Instance Downtime	Federation Resumed At (Including Downtime)	Total Retry Attempts
1m 0s	1m 9s	13
10m 0s	11m 14s	23
30m 0s	34m 24s	28
1h 0m	1h 7m	31
2h 0m	2h 11m	34
3h 0m	3h 25m	36
6h 0m	6h 41m	39
12h 0m	13h 3m	42
24h 0m	25h 30m	45
48h 0m	49h 49m	48

code to generate these tables

Also I was actually thinking that it might be smart to go the opposite direction than this PR and remove the "dead" flag altogether and instead just keep exponentially increasing the retry delays for all instances - maybe with a limit of a week or so.

Capping at an hour I guess kinda works, then we bascially have an exponential delay between 0s and 1h, then fixed at every 1h for a day, and after that fixed at 24h (the dead check). But reducing the base of the exponentiation seems cleaner.

phiresky · 2024-04-05T13:46:46Z

I noticed another problem with the outgoing federation workers. It looks like the list of dead and alive instances is never reloaded after starting.

I don't think that's true, the list of instances is fully refreshed in the loop in start_stop_federation_workers, which happens every INSTANCES_RECHECK_DELAY (once per minute)

Nutomic · 2024-04-05T14:47:08Z

Also I was actually thinking that it might be smart to go the opposite direction than this PR and remove the "dead" flag altogether and instead just keep exponentially increasing the retry delays for all instances - maybe with a limit of a week or so.

That seems like a good idea, it means there is one less value to check and less complexity. However I dont like the values you are showing. If an instance is down for 3 hours it will be defederated for 6.5 hours before federation resumes. That doesnt make sense because its extremely cheap to send an inbox request, so we can easily do it once per hour during the entire first day. For the limit one day should be fine, its what we are using now as well.

I don't think that's true, the list of instances is fully refreshed in the loop in start_stop_federation_workers, which happens every INSTANCES_RECHECK_DELAY (once per minute)

True, I missed the loop there.

phiresky · 2024-04-05T15:04:35Z

If an instance is down for 3 hours it will be defederated for 6.5 hours

I think you might be misreading my tables (they are a bit confusing), the second column is the total duration including the downtime, so for the 3 h downtime the additional delay until federation resumes would be:

backoff	additional downtime
2^N	1h 33 min
1.5^N	1h 09 min
1.25^N	0h 25 min

With 1.25^N the delay until refederation is around or less than your fixed 1h limit all the way up to an instance downtime of > 24h

Nutomic · 2024-04-08T09:52:28Z

Okay in that case it sounds good, Ive changed it to 1.25^n. And I also changed it to ignore the first error, so it doesnt sleep over a second with only a single failure. Instead there need to be at least two failures before it starts sleeping.

I also thought about getting not marking instances as dead anymore. But then we would have to keep thousands of federation workers around for dead instances which do nothing but sleep for a very long time. We could instead mark instances as dead based on last_successful_published_time, but that isnt available in start_stop_federation_workers() so it would be unnecessarily complicated.

dessalines · 2024-04-09T23:32:23Z

scripts/test.sh

@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1

 if [ -n "$PACKAGE" ];
 then
-  cargo test -p $PACKAGE --all-features --no-fail-fast
+  cargo test -p $PACKAGE --all-features --no-fail-fast $TEST


Limit federation send retry interval to one hour

3f9e182

Nutomic requested review from dessalines, phiresky and SleeplessOne1917 as code owners April 5, 2024 09:35

Nutomic commented Apr 5, 2024

View reviewed changes

Nutomic added 2 commits April 5, 2024 12:11

clippy

b9f22ab

avoid overflow

cd00be9

Nutomic added 2 commits April 8, 2024 11:25

change base for exp backoff

e4abdef

ignore first error

2c56bf8

Nutomic requested a review from dullbananas as a code owner April 8, 2024 09:45

fix day duration

494b275

phiresky approved these changes Apr 8, 2024

View reviewed changes

Nutomic changed the title ~~Limit federation send retry interval to one hour~~ Change exponential backoff algorithm for federation send Apr 8, 2024

dessalines approved these changes Apr 9, 2024

View reviewed changes

dessalines merged commit b467098 into main Apr 9, 2024
2 checks passed

phiresky mentioned this pull request Oct 30, 2024

[Bug]: Table locks / slow queries on 0.19.6 betas #4983

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change exponential backoff algorithm for federation send #4597

Change exponential backoff algorithm for federation send #4597

Nutomic commented Apr 5, 2024

Nutomic Apr 5, 2024

dessalines Apr 9, 2024

phiresky commented Apr 5, 2024 •

edited

Loading

phiresky commented Apr 5, 2024

Nutomic commented Apr 5, 2024

phiresky commented Apr 5, 2024 •

edited

Loading

Nutomic commented Apr 8, 2024

dessalines Apr 9, 2024

Change exponential backoff algorithm for federation send #4597

Change exponential backoff algorithm for federation send #4597

Conversation

Nutomic commented Apr 5, 2024

Nutomic Apr 5, 2024

Choose a reason for hiding this comment

dessalines Apr 9, 2024

Choose a reason for hiding this comment

phiresky commented Apr 5, 2024 • edited Loading

2**retries (current in lemmy)

1.5**retries

1.25**retries

phiresky commented Apr 5, 2024

Nutomic commented Apr 5, 2024

phiresky commented Apr 5, 2024 • edited Loading

Nutomic commented Apr 8, 2024

dessalines Apr 9, 2024

Choose a reason for hiding this comment

phiresky commented Apr 5, 2024 •

edited

Loading

`2**retries` (current in lemmy)

`1.5**retries`

`1.25**retries`

phiresky commented Apr 5, 2024 •

edited

Loading