Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

lhoguin · 2023-03-10T09:02:17Z

Please check the commit comment for details.

This probably affects all released versions for the past 12 years. I have only tested that the issue exists in v3.9.x branch, v3.10.x, v3.11.x, v3.12.x and main branches. This should be backported at a minimum to v3.12.x so we can do further chaos testing. Whether this should be backported to v3.11.x or earlier is an open question.

mkuratczyk · 2023-03-10T16:59:29Z

I'm afraid I can still reproduce the issues on the PR branch:

errorContext: child_terminated
reason: {{badmatch,-1},
         [{rabbit_mirror_queue_slave,update_delta,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,1064}]},
          {rabbit_mirror_queue_slave,process_instruction,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,983}]},
          {rabbit_mirror_queue_slave,handle_cast,2,
              [{file,"rabbit_mirror_queue_slave.erl"},{line,312}]},
          {gen_server2,handle_msg,2,
              [{file,"gen_server2.erl"},{line,1067}]},
          {proc_lib,init_p_do_apply,3,
              [{file,"proc_lib.erl"},{line,240}]}]}
offender: [{pid,<0.570.0>},
           {id,rabbit_amqqueue},
           {mfargs,
               {rabbit_prequeue,start_link,
                   [{amqqueue,
                        {resource,<<"/">>,queue,<<"fivers-2">>},
                        true,false,none,
                        [{<<"x-queue-version">>,long,2},
                         {<<"x-max-length">>,long,1000000}],
                        <20639.1325.0>,
                        [<20638.572.0>],
                        [<20638.572.0>],
                        ['rabbit@pr7579-cluster-cqv2-mirror-s100-server-1.pr7579-cluster-cqv2-mirror-s100-nodes.chaos-tests',
                         'rabbit@pr7579-cluster-cqv2-mirror-s100-server-2.pr7579-cluster-cqv2-mirror-s100-nodes.chaos-tests'],
                        [{vhost,<<"/">>},
                         {name,<<"mirroring-policy">>},
                         {pattern,<<".*">>},
                         {'apply-to',<<"queues">>},
                         {definition,
                             [{<<"ha-mode">>,<<"exactly">>},
                              {<<"ha-params">>,3},
                              {<<"ha-sync-mode">>,<<"automatic">>}]},
                         {priority,0}],
                        undefined,
                        [{<20638.573.0>,<20638.572.0>},
                         {<20639.1328.0>,<20639.1325.0>}],
                        [],live,0,[],<<"/">>,
                        #{user => <<"default_user__f64vdYKZPR6xyvONuG">>},
                        rabbit_classic_queue,#{}},
                    slave,<0.568.0>]}},
           {restart_type,transient},
           {significant,true},
           {shutdown,600000},
           {child_type,worker}]

When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case).

mkuratczyk · 2023-03-24T10:30:10Z

I've been chaos-testing this fix for 24 hours on two environments. That's much longer than the time-to-crash observed before. Seems like that's it. Thanks!

When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case). (cherry picked from commit db1e420)

When a node goes down a slave gets promoted to master. When this happens the new master requeues all messages pending acks. If x-max-length is defined and the queue length after requeue goes over the limit, the new master will start dropping messages immediately. This causes issues for other slaves because they do not requeue their messages automatically, instead they wait for the new master to tell them what to do. This eventually triggers an assert because the queue length are unexpectedly out of sync when the first drop message is propagated to the cluster. This issue must have been present for a very long time, probably since e352608. The fix is to make the new master propagate the requeues when it gets promoted. To reproduce, a cluster must be started, ha-mode: all set via policies, and perf-test started with the following arguments: perf-test -x 1 -y 1 -r 10000 -R 50 -c 500 -s 1000 -u v2 \ -qa x-queue-version=2,x-max-length=10000 -ad false -f persistent Wait a little bit for the queue to have 10000+ ready messages (not total, total will be more) and then kill the master node (usually the first pid that 'ps -aux | grep beam' gives you). The crashes will be logged in the slave node that was not promoted (node 2 in my case). (cherry picked from commit db1e420) Co-authored-by: Loïc Hoguin <[email protected]>

lhoguin added the backport-v3.12.x label Mar 10, 2023

essen force-pushed the lh-fix-v2-mirrored-queues-again branch 3 times, most recently from 5461c2f to 0b0bc0b Compare March 21, 2023 09:04

essen force-pushed the lh-fix-v2-mirrored-queues-again branch from 0b0bc0b to 8da5acf Compare March 23, 2023 13:48

lhoguin changed the title ~~Fix ack crashes when using CMQs with v2~~ Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected Mar 23, 2023

lhoguin assigned mkuratczyk Mar 24, 2023

lhoguin marked this pull request as ready for review March 24, 2023 09:06

mkuratczyk force-pushed the lh-fix-v2-mirrored-queues-again branch from 8da5acf to f22029d Compare March 24, 2023 09:30

michaelklishin added this to the 3.12.0 milestone Mar 24, 2023

mkuratczyk merged commit db1e420 into main Mar 24, 2023

mkuratczyk deleted the lh-fix-v2-mirrored-queues-again branch March 24, 2023 10:30

mergify bot mentioned this pull request Mar 24, 2023

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected (backport #7579) #7731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

lhoguin commented Mar 10, 2023 •

edited

Loading

mkuratczyk commented Mar 10, 2023

mkuratczyk commented Mar 24, 2023

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

Fix CMQ crash when master goes down, queue length at x-max-length limit with consumers connected #7579

Conversation

lhoguin commented Mar 10, 2023 • edited Loading

mkuratczyk commented Mar 10, 2023

mkuratczyk commented Mar 24, 2023

lhoguin commented Mar 10, 2023 •

edited

Loading