-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove-brokers rebalancing seems to get stuck by race condition #10631
Comments
Triaged on 03.10.2024: it needs to be investigated and fixed. |
For more information, the logs here contains a failure on the STs (linked here) related to this issue: https://dev.azure.com/cncf/strimzi/_build/results?buildId=180961&view=artifacts&pathAsName=false&type=publishedArtifacts |
I had an investigation on this issue even related to the auto-rebalancing logic (where it's mostly failing in the STs above). I think the only way is handling errors case by case, so re-issuing the rebalance proposal request could be ok the first time but if it returns a clearer error about what’s going wrong (i.e. a broker which does not exist), the What I am not sure right now is why it's not already updating the For this specific example, the So I think ending in @ShubhamRwt I hope the above makes sense and could be helpful to your resolution. |
Thanks Paolo, yes it makes things more clear. |
When un-empty nodes are scaled down, the scale-down is blocked ad the nodes need to be first cleaned up for example using the remove-brokers feature in Cruise Control. Once the scaled-down nodes are empty, CO will execute the scale-down and delete them. But it seems that there is a space for a race condition between the
KafkaAssemblyOperator
andKafkaRebalanceAssemblyOperator
:KafkaRebaanceAssemblyOperator
marks theKafkaRebalance
resource asRebalancing
and periodically (every 2 minutes) checks the progressKafkaAssemblyOperator
sees that the nodes are already empty and proceeds to scale-down the broker and roll Cruise Control with the new cluster configurationKafkaRebalanceAssemblyOperator
starts another reconciliation round. But it seems that:KafkaRebalanceAssemblyOperator
tries to recreate it and seems to get stuck:The text was updated successfully, but these errors were encountered: