Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically check for unapplied policies on QQs #12412

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

LoisSotoLopez
Copy link
Contributor

Proposed Changes

As documented #7863 :

If a quorum queue is unavailable when a policy is changed it may never apply the resulting configuration command and thus be out of sync with the matching policy.

This PR provides a function in rabbit_quorum_queue.erl that checks whether the current Ra Machine configuration for a queue corresponds to the expected configuration to be in use based on defined policies. That function is called by each queue process on tick (handle_tick).

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Further Comments

.

LoisSotoLopez and others added 10 commits October 1, 2024 10:34
Instead of checking the values for current configuration, represented in
`rabbit_quorum_queue:handle_tick` by the `Overview` variable, against
the effective policy, just regenerate the configuration and compare with
the current configuration.
(some of this is just reverting to the original format to reduce the
diff against main)
ShouldUpdate = NewPolicyConfig =/= CurrentPolicyConfig,
case ShouldUpdate of
true ->
rabbit_log:debug("Re-applying policies to ~p", [amqqueue:get_name(Q)]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log messages should use rabbit_misc:rs/2.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "maybe log" changes the internal API in ways that are difficult to justify.

@@ -1528,35 +1555,35 @@ reclaim_memory(Vhost, QueueName) ->
ra_log_wal:force_roll_over({?RA_WAL_NAME, Node}).

%%----------------------------------------------------------------------------
dead_letter_handler(Q, Overflow) ->
dead_letter_handler(Q, Overflow, ShouldLog) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly dislike this extra argument and how it changes existing functions. If an invalid overflow strategy is used, I don't see a problem with logging that periodically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I agree, I'm not sure what value they add to this PR

Servers),

% Wait for the queue to be available again.
lists:foreach(fun(Srv) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reinventing rabbit_ct_helpers:await_condition/2.

end,
Consume([]).

ensure_qq_proc_dead(Config, Server, RaName) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target process recovers in fewer than 500ms, this function will loop forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ra supervisor has a max restart intensity of 2 restarts per 5 seconds https://github.com/rabbitmq/ra/blob/main/src/ra_server_sup.erl#L36-L37. So supervisor will give up eventually.
Otoh if the process restart takes more than 500ms then this loop would stop before the process is dead completely. But I think this is highly unlikely for a test queue.

rabbit_log:info("~ts: delivery_limit not set, defaulting to ~b",
[rabbit_misc:rs(QName), ?DEFAULT_DELIVERY_LIMIT]),
maybe_log(ShouldLog, info,
"~ts: delivery_limit not set, defaulting to ~b",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dear Core Team, should this message be logged unconditionally as well? This is not a misconfiguration, if a user is happy with the default value and does not set an explicit delivery-limit, this will be logged all the time for all the quorum queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there to be clear that there is a default set. We can perhaps lower it to debug in 4.1 or remove it completely but I think making users aware of this potentially breaking change doesn't harm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to a function that is not called periodically and remove it from this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we don't necessarily have to do it in rabbit_quorum_queue if there's a more suitable alternative where the delivery limit is known.

Removes the usage of a ShouldLog parameter on several functions
and limits the logging of the message warning about the delivery_limit
not being set to the moment of queueDeclaration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants