Periodically check for unapplied policies on QQs #12412

LoisSotoLopez · 2024-10-01T08:46:33Z

Proposed Changes

As documented #7863 :

If a quorum queue is unavailable when a policy is changed it may never apply the resulting configuration command and thus be out of sync with the matching policy.

This PR provides a function in rabbit_quorum_queue.erl that checks whether the current Ra Machine configuration for a queue corresponds to the expected configuration to be in use based on defined policies. That function is called by each queue process on tick (handle_tick).

Types of Changes

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

.

Instead of checking the values for current configuration, represented in `rabbit_quorum_queue:handle_tick` by the `Overview` variable, against the effective policy, just regenerate the configuration and compare with the current configuration.

(some of this is just reverting to the original format to reduce the diff against main)

michaelklishin · 2024-10-01T20:01:55Z

deps/rabbit/src/rabbit_quorum_queue.erl

+    ShouldUpdate = NewPolicyConfig =/= CurrentPolicyConfig,
+    case ShouldUpdate of
+        true ->
+            rabbit_log:debug("Re-applying policies to ~p", [amqqueue:get_name(Q)]),


Log messages should use rabbit_misc:rs/2.

michaelklishin

The "maybe log" changes the internal API in ways that are difficult to justify.

michaelklishin · 2024-10-01T20:04:24Z

deps/rabbit/src/rabbit_quorum_queue.erl

@@ -1528,35 +1555,35 @@ reclaim_memory(Vhost, QueueName) ->
    ra_log_wal:force_roll_over({?RA_WAL_NAME, Node}).

 %%----------------------------------------------------------------------------
-dead_letter_handler(Q, Overflow) ->
+dead_letter_handler(Q, Overflow, ShouldLog) ->


I strongly dislike this extra argument and how it changes existing functions. If an invalid overflow strategy is used, I don't see a problem with logging that periodically.

yes I agree, I'm not sure what value they add to this PR

michaelklishin · 2024-10-01T20:07:00Z

deps/rabbit/test/quorum_queue_SUITE.erl

+        Servers),
+
+    % Wait for the queue to be available again. 
+    lists:foreach(fun(Srv) ->


This is reinventing rabbit_ct_helpers:await_condition/2.

michaelklishin · 2024-10-01T20:08:31Z

deps/rabbit/test/quorum_queue_SUITE.erl

+    end,
+    Consume([]).
+
+ensure_qq_proc_dead(Config, Server, RaName) ->


If the target process recovers in fewer than 500ms, this function will loop forever.

ra supervisor has a max restart intensity of 2 restarts per 5 seconds https://github.com/rabbitmq/ra/blob/main/src/ra_server_sup.erl#L36-L37. So supervisor will give up eventually.
Otoh if the process restart takes more than 500ms then this loop would stop before the process is dead completely. But I think this is highly unlikely for a test queue.

gomoripeti · 2024-10-03T14:54:10Z

deps/rabbit/src/rabbit_quorum_queue.erl

-                            rabbit_log:info("~ts: delivery_limit not set, defaulting to ~b",
-                                             [rabbit_misc:rs(QName), ?DEFAULT_DELIVERY_LIMIT]),
+                            maybe_log(ShouldLog, info,
+                                      "~ts: delivery_limit not set, defaulting to ~b",


dear Core Team, should this message be logged unconditionally as well? This is not a misconfiguration, if a user is happy with the default value and does not set an explicit delivery-limit, this will be logged all the time for all the quorum queues.

It's there to be clear that there is a default set. We can perhaps lower it to debug in 4.1 or remove it completely but I think making users aware of this potentially breaking change doesn't harm.

We can move this to a function that is not called periodically and remove it from this function.

In fact, we don't necessarily have to do it in rabbit_quorum_queue if there's a more suitable alternative where the delivery limit is known.

Removes the usage of a ShouldLog parameter on several functions and limits the logging of the message warning about the delivery_limit not being set to the moment of queueDeclaration

LoisSotoLopez and others added 10 commits October 1, 2024 10:34

Add QQ periodic policy repair

eaa1891

Add test for QQ policy repair feature

3ea5d65

Use ra_machine_config but limit keys to check

512c2ac

Refactoring suggestion

47d0494

(some of this is just reverting to the original format to reduce the diff against main)

Move tests to main qq SUITE & refactor a bit

239ccca

Consider QQs may let pass 1st overflowing msg

0e4fea7

Use local function for ensuring qq proc dead

e061f54

Use wait_for_messages_ready

a6d59ba

Simplify publish_confirm_many

3bb47e2

michaelklishin reviewed Oct 1, 2024

View reviewed changes

michaelklishin requested changes Oct 1, 2024

View reviewed changes

michaelklishin reviewed Oct 1, 2024

View reviewed changes

gomoripeti reviewed Oct 3, 2024

View reviewed changes

Remove ShouldLog & limit deliv. limit not set logg

0ccce0b

Removes the usage of a ShouldLog parameter on several functions and limits the logging of the message warning about the delivery_limit not being set to the moment of queueDeclaration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically check for unapplied policies on QQs #12412

Periodically check for unapplied policies on QQs #12412

LoisSotoLopez commented Oct 1, 2024

michaelklishin Oct 1, 2024

michaelklishin left a comment

michaelklishin Oct 1, 2024

kjnilsson Oct 3, 2024

michaelklishin Oct 1, 2024

michaelklishin Oct 1, 2024

gomoripeti Oct 3, 2024

gomoripeti Oct 3, 2024

kjnilsson Oct 3, 2024

michaelklishin Oct 3, 2024

michaelklishin Oct 3, 2024

Periodically check for unapplied policies on QQs #12412

Are you sure you want to change the base?

Periodically check for unapplied policies on QQs #12412

Conversation

LoisSotoLopez commented Oct 1, 2024

Proposed Changes

Types of Changes

Checklist

Further Comments

Choose a reason for hiding this comment

michaelklishin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment