Recovery terms in per-queue files instead of DETS #7726

mkuratczyk · 2023-03-23T16:20:49Z

Note: this PR maintains backwards compatibility; a separate PR will remove the old bits

Per-vhost DETS file with recovery terms for all queues is a bottleneck when stopping RabbitMQ - all queues try save their state, leading to a very long file server mailbox and very unpredictable time to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes with 100k classic queues).

In this PR we can still read the recovery terms from DETS but we only save them in per-queue files. This way each queue can quickly store its state. Under the same condition, my machine can consistently stop RabbitMQ in 15 seconds or so.

The tradeoff is a slower startup time: on my machine, it goes up from 29 seconds to 38 seconds, but that's still better than what we had until #7676 was merged a few days ago. More importantly, the total of stop+start is lower and more predictable.

This PR also improves shutdown with many classic queues v1. Startup time with 100k CQv1s is so long and unpredictable that it's hard to even tell if this PR affects it (it varies from 4 to 8 minutes for me).

Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s for me), but we don't have to optimise for that. In most cases (with much fewer queues), it won't be noticeable anyway.

Per-vhost DETS file with recovery terms for all queues is a bottleneck when stopping RabbitMQ - all queues try save their state, leading to a very long file server mailbox and very unpredictable time to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes with 100k classic queues). In this PR we can still read the recovery terms from DETS but we only save them in per-queue files. This way each queue can quickly store its state. Under the same condition, my machine can consistently stop RabbitMQ in 15 seconds or so. The tradeoff is a slower startup time: on my machine, it goes up from 29 seconds to 38 seconds, but that's still better than what we had until #7676 was merged a few days ago. More importantly, the total of stop+start is lower and more predictable. This PR also improves shutdown with many classic queues v1. Startup time with 100k CQv1s is so long and unpredictable that it's hard to even tell if this PR affects it (it varies from 4 to 8 minutes for me). Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s for me), but we don't have to optimise for that. In most cases (with much fewer queues), it won't be noticable anyway.

mkuratczyk · 2023-03-24T07:53:23Z

Unfortunately the startup time trade-off of this approach is very high in my GKE tests: 52s for v3.12.x vs 136s with this branch. Shutdown time is better by 20s but that's not enough to offset the startup. Back to the drawing board.

mkuratczyk · 2023-03-24T10:31:03Z

A completely different approach is in the works.

mkuratczyk closed this Mar 24, 2023

This was referenced Mar 24, 2023

Do not use DETS ram file for recovery terms #7736

Closed

Recovery terms: use ram_file on start, but not on shutdown #7745

Merged

michaelklishin deleted the per-queue-recovery-terms branch October 30, 2023 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery terms in per-queue files instead of DETS #7726

Recovery terms in per-queue files instead of DETS #7726

mkuratczyk commented Mar 23, 2023

mkuratczyk commented Mar 24, 2023

mkuratczyk commented Mar 24, 2023

Recovery terms in per-queue files instead of DETS #7726

Recovery terms in per-queue files instead of DETS #7726

Conversation

mkuratczyk commented Mar 23, 2023

mkuratczyk commented Mar 24, 2023

mkuratczyk commented Mar 24, 2023