-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I'm afraid I can still reproduce the issues on the PR branch:
|
essen
force-pushed
the
lh-fix-v2-mirrored-queues-again
branch
3 times, most recently
from
March 21, 2023 09:04
5461c2f
to
0b0bc0b
Compare
essen
force-pushed
the
lh-fix-v2-mirrored-queues-again
branch
from
March 23, 2023 13:48
0b0bc0b
to
8da5acf
Compare
lhoguin
changed the title
Fix ack crashes when using CMQs with v2
Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected
Mar 23, 2023
When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case).
mkuratczyk
force-pushed
the
lh-fix-v2-mirrored-queues-again
branch
from
March 24, 2023 09:30
8da5acf
to
f22029d
Compare
I've been chaos-testing this fix for 24 hours on two environments. That's much longer than the time-to-crash observed before. Seems like that's it. Thanks! |
mergify bot
pushed a commit
that referenced
this pull request
Mar 24, 2023
When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case). (cherry picked from commit db1e420)
mkuratczyk
pushed a commit
that referenced
this pull request
Mar 24, 2023
When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case). (cherry picked from commit db1e420) Co-authored-by: Loïc Hoguin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please check the commit comment for details.
This probably affects all released versions for the past 12 years. I have only tested that the issue exists in v3.9.x branch, v3.10.x, v3.11.x, v3.12.x and main branches. This should be backported at a minimum to v3.12.x so we can do further chaos testing. Whether this should be backported to v3.11.x or earlier is an open question.