-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax feature flag compat check during join cluster #9729
Merged
dumbbell
merged 3 commits into
main
from
relax-feature-flag-compat-check-during-join_cluster
Oct 1, 2024
Merged
Relax feature flag compat check during join cluster #9729
dumbbell
merged 3 commits into
main
from
relax-feature-flag-compat-check-during-join_cluster
Oct 1, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dumbbell
force-pushed
the
relax-feature-flag-compat-check-during-join_cluster
branch
from
October 19, 2023 10:57
c0be3ee
to
783ddce
Compare
9 tasks
dumbbell
force-pushed
the
relax-feature-flag-compat-check-during-join_cluster
branch
from
October 24, 2023 13:24
783ddce
to
5870c34
Compare
dumbbell
force-pushed
the
relax-feature-flag-compat-check-during-join_cluster
branch
2 times, most recently
from
September 24, 2024 15:50
194795b
to
ead4a05
Compare
dumbbell
force-pushed
the
relax-feature-flag-compat-check-during-join_cluster
branch
3 times, most recently
from
September 25, 2024 14:55
c14d327
to
dba4b6e
Compare
... with older RabbitMQ versions which don't know about Khepri. [Why] When an older node wants to join a cluster, it calls `node_info/0` and `cluster_status_from_mnesia/0` directly using RPC calls. If it does that against a node already using Khepri, t will get an error telling it that Mnesia is not running. The error is reported to the end user, making it difficult to understand the problem: both nodes are simply incompatible. It's better to leave the final decision to the Feature flags subsystem, but for that, `rabbit_mnesia` on the newer Khepri-based node still needs to return something the older version can accept. [How] `cluster_status_from_mnesia/0` and `node_info/0` are modified to verify if Khepri is enabled and if it is, return a value based on Khepri's status as if it was from Mnesia. This will let the remote older node to continue all its checks and eventually refuse to join because the Feature flags subsystem will indicate they are incompatible.
…stency` is false [Why] `CheckNodesConsistency` is set to false when the `check_cluster_consistency()` is called as part of a node joining a cluster. And the generic compatibility check was already executed by `rabbit_db_cluster`. There is no need to run it again. This is even counter-productive with the improvement to `rabbit_feature_flags:check_node_compatibility/2` that follows.
... that considers the local node as if it was reset. [Why] When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster. However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine. [How] We introduce a new variant of `check_node_compatibility/2` that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset). This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster. This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version. References #9677.
dumbbell
force-pushed
the
relax-feature-flag-compat-check-during-join_cluster
branch
from
October 1, 2024 08:52
dba4b6e
to
f69c082
Compare
mkuratczyk
approved these changes
Oct 1, 2024
dumbbell
deleted the
relax-feature-flag-compat-check-during-join_cluster
branch
October 1, 2024 09:52
mkuratczyk
added a commit
to rabbitmq/rabbitmq-website
that referenced
this pull request
Oct 1, 2024
rabbitmq/rabbitmq-server#9729 has been merged. Starting with 4.1, there's no need to disable the new FFs when starting a new node.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster.
However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine.
How
We introduce a new variant of
check_node_compatibility/2
that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset).This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster.
This also removes the need to use
$RABBITMQ_FEATURE_FLAGS
environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version.References #9677.