Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery terms in per-queue files instead of DETS #7726

Closed
wants to merge 1 commit into from

Conversation

mkuratczyk
Copy link
Contributor

Note: this PR maintains backwards compatibility; a separate PR will remove the old bits

Per-vhost DETS file with recovery terms for all queues is a bottleneck when stopping RabbitMQ - all queues try save their state, leading to a very long file server mailbox and very unpredictable time to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes with 100k classic queues).

In this PR we can still read the recovery terms from DETS but we only save them in per-queue files. This way each queue can quickly store its state. Under the same condition, my machine can consistently stop RabbitMQ in 15 seconds or so.

The tradeoff is a slower startup time: on my machine, it goes up from 29 seconds to 38 seconds, but that's still better than what we had until #7676 was merged a few days ago. More importantly, the total of stop+start is lower and more predictable.

This PR also improves shutdown with many classic queues v1. Startup time with 100k CQv1s is so long and unpredictable that it's hard to even tell if this PR affects it (it varies from 4 to 8 minutes for me).

Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s for me), but we don't have to optimise for that. In most cases (with much fewer queues), it won't be noticeable anyway.

Per-vhost DETS file with recovery terms for all queues is a bottleneck
when stopping RabbitMQ - all queues try save their state, leading
to a very long file server mailbox and very unpredictable time
to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes
with 100k classic queues).

In this PR we can still read the recovery terms from DETS but we only
save them in per-queue files. This way each queue can quickly store its
state. Under the same condition, my machine can consistently stop
RabbitMQ in 15 seconds or so.

The tradeoff is a slower startup time: on my machine, it goes up from
29 seconds to 38 seconds, but that's still better than what we had until
#7676 was merged a few
days ago. More importantly, the total of stop+start is lower and more
predictable.

This PR also improves shutdown with many classic queues v1.
Startup time with 100k CQv1s is so long and unpredictable that it's hard
to even tell if this PR affects it (it varies from 4 to 8 minutes for me).

Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s
for me), but we don't have to optimise for that. In most cases (with
much fewer queues), it won't be noticable anyway.
@mkuratczyk
Copy link
Contributor Author

Unfortunately the startup time trade-off of this approach is very high in my GKE tests: 52s for v3.12.x vs 136s with this branch. Shutdown time is better by 20s but that's not enough to offset the startup. Back to the drawing board.

@mkuratczyk
Copy link
Contributor Author

A completely different approach is in the works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant