Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

Merged
merged 1 commit into from
Mar 24, 2023

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Mar 10, 2023

Please check the commit comment for details.

This probably affects all released versions for the past 12 years. I have only tested that the issue exists in v3.9.x branch, v3.10.x, v3.11.x, v3.12.x and main branches. This should be backported at a minimum to v3.12.x so we can do further chaos testing. Whether this should be backported to v3.11.x or earlier is an open question.

@mkuratczyk
Copy link
Contributor

I'm afraid I can still reproduce the issues on the PR branch:

errorContext: child_terminated
reason: {{badmatch,-1},
         [{rabbit_mirror_queue_slave,update_delta,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,1064}]},
          {rabbit_mirror_queue_slave,process_instruction,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,983}]},
          {rabbit_mirror_queue_slave,handle_cast,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,312}]},
          {gen_server2,handle_msg,2,
              [{file,"gen_server2.erl"},{line,1067}]},
          {proc_lib,init_p_do_apply,3,
              [{file,"proc_lib.erl"},{line,240}]}]}
offender: [{pid,<0.570.0>},
           {id,rabbit_amqqueue},
           {mfargs,
               {rabbit_prequeue,start_link,
                   [{amqqueue,
                        {resource,<<"/">>,queue,<<"fivers-2">>},
                        true,false,none,
                        [{<<"x-queue-version">>,long,2},
                         {<<"x-max-length">>,long,1000000}],
                        <20639.1325.0>,
                        [<20638.572.0>],
                        [<20638.572.0>],
                        ['rabbit@pr7579-cluster-cqv2-mirror-s100-server-1.pr7579-cluster-cqv2-mirror-s100-nodes.chaos-tests',
                         'rabbit@pr7579-cluster-cqv2-mirror-s100-server-2.pr7579-cluster-cqv2-mirror-s100-nodes.chaos-tests'],
                        [{vhost,<<"/">>},
                         {name,<<"mirroring-policy">>},
                         {pattern,<<".*">>},
                         {'apply-to',<<"queues">>},
                         {definition,
                             [{<<"ha-mode">>,<<"exactly">>},
                              {<<"ha-params">>,3},
                              {<<"ha-sync-mode">>,<<"automatic">>}]},
                         {priority,0}],
                        undefined,
                        [{<20638.573.0>,<20638.572.0>},
                         {<20639.1328.0>,<20639.1325.0>}],
                        [],live,0,[],<<"/">>,
                        #{user => <<"default_user__f64vdYKZPR6xyvONuG">>},
                        rabbit_classic_queue,#{}},
                    slave,<0.568.0>]}},
           {restart_type,transient},
           {significant,true},
           {shutdown,600000},
           {child_type,worker}]

@essen essen force-pushed the lh-fix-v2-mirrored-queues-again branch 3 times, most recently from 5461c2f to 0b0bc0b Compare March 21, 2023 09:04
@essen essen force-pushed the lh-fix-v2-mirrored-queues-again branch from 0b0bc0b to 8da5acf Compare March 23, 2023 13:48
@lhoguin lhoguin changed the title Fix ack crashes when using CMQs with v2 Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected Mar 23, 2023
@lhoguin lhoguin marked this pull request as ready for review March 24, 2023 09:06
When a node goes down a slave gets promoted to master. When this
happens the new master requeues all messages pending acks. If
x-max-length is defined and the queue length after requeue goes
over the limit, the new master will start dropping messages
immediately.

This causes issues for other slaves because they do not requeue
their messages automatically, instead they wait for the new
master to tell them what to do. This eventually triggers an
assert because the queue length are unexpectedly out of sync
when the first drop message is propagated to the cluster.

This issue must have been present for a very long time,
probably since e352608.

The fix is to make the new master propagate the requeues when
it gets promoted.

To reproduce, a cluster must be started, ha-mode: all set via
policies, and perf-test started with the following arguments:

perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \
    -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent

Wait a little bit for the queue to have 10000+ ready messages
(not total, total will be more) and then kill the master node
(usually the first pid that 'ps -aux | grep beam' gives you).
The crashes will be logged in the slave node that was not
promoted (node 2 in my case).
@mkuratczyk mkuratczyk force-pushed the lh-fix-v2-mirrored-queues-again branch from 8da5acf to f22029d Compare March 24, 2023 09:30
@michaelklishin michaelklishin added this to the 3.12.0 milestone Mar 24, 2023
@mkuratczyk
Copy link
Contributor

I've been chaos-testing this fix for 24 hours on two environments. That's much longer than the time-to-crash observed before. Seems like that's it. Thanks!

@mkuratczyk mkuratczyk merged commit db1e420 into main Mar 24, 2023
@mkuratczyk mkuratczyk deleted the lh-fix-v2-mirrored-queues-again branch March 24, 2023 10:30
mergify bot pushed a commit that referenced this pull request Mar 24, 2023
When a node goes down a slave gets promoted to master. When this
happens the new master requeues all messages pending acks. If
x-max-length is defined and the queue length after requeue goes
over the limit, the new master will start dropping messages
immediately.

This causes issues for other slaves because they do not requeue
their messages automatically, instead they wait for the new
master to tell them what to do. This eventually triggers an
assert because the queue length are unexpectedly out of sync
when the first drop message is propagated to the cluster.

This issue must have been present for a very long time,
probably since e352608.

The fix is to make the new master propagate the requeues when
it gets promoted.

To reproduce, a cluster must be started, ha-mode: all set via
policies, and perf-test started with the following arguments:

perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \
    -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent

Wait a little bit for the queue to have 10000+ ready messages
(not total, total will be more) and then kill the master node
(usually the first pid that 'ps -aux | grep beam' gives you).
The crashes will be logged in the slave node that was not
promoted (node 2 in my case).

(cherry picked from commit db1e420)
mkuratczyk pushed a commit that referenced this pull request Mar 24, 2023
When a node goes down a slave gets promoted to master. When this
happens the new master requeues all messages pending acks. If
x-max-length is defined and the queue length after requeue goes
over the limit, the new master will start dropping messages
immediately.

This causes issues for other slaves because they do not requeue
their messages automatically, instead they wait for the new
master to tell them what to do. This eventually triggers an
assert because the queue length are unexpectedly out of sync
when the first drop message is propagated to the cluster.

This issue must have been present for a very long time,
probably since e352608.

The fix is to make the new master propagate the requeues when
it gets promoted.

To reproduce, a cluster must be started, ha-mode: all set via
policies, and perf-test started with the following arguments:

perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \
    -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent

Wait a little bit for the queue to have 10000+ ready messages
(not total, total will be more) and then kill the master node
(usually the first pid that 'ps -aux | grep beam' gives you).
The crashes will be logged in the slave node that was not
promoted (node 2 in my case).

(cherry picked from commit db1e420)

Co-authored-by: Loïc Hoguin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants