Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax feature flag compat check during join cluster #9729

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

dumbbell
Copy link
Member

Why

When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster.

However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine.

How

We introduce a new variant of check_node_compatibility/2 that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset).

This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster.

This also removes the need to use $RABBITMQ_FEATURE_FLAGS environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version.

References #9677.

@dumbbell dumbbell added this to the 3.13.0 milestone Oct 19, 2023
@dumbbell dumbbell self-assigned this Oct 19, 2023
@dumbbell dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from c0be3ee to 783ddce Compare October 19, 2023 10:57
@dumbbell dumbbell removed this from the 3.13.0 milestone Oct 20, 2023
@dumbbell dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from 783ddce to 5870c34 Compare October 24, 2023 13:24
@dumbbell dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch 2 times, most recently from 194795b to ead4a05 Compare September 24, 2024 15:50
@mergify mergify bot added the bazel label Sep 24, 2024
@dumbbell dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch 3 times, most recently from c14d327 to dba4b6e Compare September 25, 2024 14:55
@dumbbell dumbbell marked this pull request as ready for review September 25, 2024 16:10
... with older RabbitMQ versions which don't know about Khepri.

[Why]
When an older node wants to join a cluster, it calls `node_info/0` and
`cluster_status_from_mnesia/0` directly using RPC calls. If it does that
against a node already using Khepri, t will get an error telling it that
Mnesia is not running. The error is reported to the end user, making it
difficult to understand the problem: both nodes are simply incompatible.

It's better to leave the final decision to the Feature flags subsystem,
but for that, `rabbit_mnesia` on the newer Khepri-based node still needs
to return something the older version can accept.

[How]
`cluster_status_from_mnesia/0` and `node_info/0` are modified to verify
if Khepri is enabled and if it is, return a value based on Khepri's
status as if it was from Mnesia.

This will let the remote older node to continue all its checks and
eventually refuse to join because the Feature flags subsystem will
indicate they are incompatible.
…stency` is false

[Why]
`CheckNodesConsistency` is set to false when the
`check_cluster_consistency()` is called as part of a node joining a
cluster. And the generic compatibility check was already executed by
`rabbit_db_cluster`.

There is no need to run it again. This is even counter-productive with
the improvement to `rabbit_feature_flags:check_node_compatibility/2`
that follows.
... that considers the local node as if it was reset.

[Why]
When a node joins a cluster, we check its compatibility with the
cluster, reset the node, copy the feature flags states from the remote
cluster and add that node to the cluster.

However, the compatibility check is performed with the current feature
flags states, even though they are about to be reset. Therefore, a node
with an enabled feature flag that is unsupported by the cluster will
refuse to join. It's incorrect because after the reset and the states
copy, it could have join the cluster just fine.

[How]
We introduce a new variant of `check_node_compatibility/2` that takes an
argument to indicate if the local node should be considered as a virgin
node (i.e. like after a reset).

This way, the joining node will always be able to join, regardless of
its initial feature flags states, as long as it doesn't require a
feature flag that is unsupported by the cluster.

This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment
variable to force a new node to leave stable feature flags disabled to
allow it to join a cluster running an older version.

References #9677.
@dumbbell dumbbell force-pushed the relax-feature-flag-compat-check-during-join_cluster branch from dba4b6e to f69c082 Compare October 1, 2024 08:52
@dumbbell dumbbell merged commit 6855ebc into main Oct 1, 2024
440 checks passed
@dumbbell dumbbell deleted the relax-feature-flag-compat-check-during-join_cluster branch October 1, 2024 09:52
mkuratczyk added a commit to rabbitmq/rabbitmq-website that referenced this pull request Oct 1, 2024
rabbitmq/rabbitmq-server#9729
has been merged. Starting with 4.1, there's no need to
disable the new FFs when starting a new node.
@dumbbell dumbbell added this to the 4.1.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

2 participants