Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove-brokers rebalancing seems to get stuck by race condition #10631

Closed
scholzj opened this issue Sep 23, 2024 · 4 comments · Fixed by #10717
Closed

Remove-brokers rebalancing seems to get stuck by race condition #10631

scholzj opened this issue Sep 23, 2024 · 4 comments · Fixed by #10717
Assignees
Labels

Comments

@scholzj
Copy link
Member

scholzj commented Sep 23, 2024

When un-empty nodes are scaled down, the scale-down is blocked ad the nodes need to be first cleaned up for example using the remove-brokers feature in Cruise Control. Once the scaled-down nodes are empty, CO will execute the scale-down and delete them. But it seems that there is a space for a race condition between the KafkaAssemblyOperator and KafkaRebalanceAssemblyOperator:

  • The remove brokers rebalance is ongoing and KafkaRebaanceAssemblyOperator marks the KafkaRebalance resource as Rebalancing and periodically (every 2 minutes) checks the progress
  • KafkaAssemblyOperator sees that the nodes are already empty and proceeds to scale-down the broker and roll Cruise Control with the new cluster configuration
  • Later (after the CC is rolled) the KafkaRebalanceAssemblyOperator starts another reconciliation round. But it seems that:
    • Cruise Control does not like the request anymore and throws exception:
      com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.IllegalArgumentException: Broker 14 does not exist.
      
    • The KafkaRebalanceAssemblyOperator tries to recreate it and seems to get stuck:
      colog | grep "#313(timer)"
      2024-09-23 21:13:37 INFO  AbstractOperator:266 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): KafkaRebalance my-cluster-auto-rebalancing-remove-brokers will be checked for creation or modification
      2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:317 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Rebalance action is performed and KafkaRebalance resource is currently in [Rebalancing] state
      2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:854 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Getting Cruise Control rebalance user task status
      2024-09-23 21:13:37 WARN  KafkaRebalanceAssemblyOperator:863 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): User task 670c1383-aa04-4979-8cc6-41fe9f69efce not found, going to generate a new proposal
      2024-09-23 21:13:37 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Requesting Cruise Control rebalance [dryrun=true]
      2024-09-23 21:14:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:15:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:16:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:17:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:18:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:19:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      2024-09-23 21:20:37 INFO  AbstractOperator:401 - Reconciliation #313(timer) KafkaRebalance(myproject/my-cluster-auto-rebalancing-remove-brokers): Reconciliation is in progress
      
@ppatierno
Copy link
Member

Triaged on 03.10.2024: it needs to be investigated and fixed.

@ppatierno
Copy link
Member

For more information, the logs here contains a failure on the STs (linked here) related to this issue: https://dev.azure.com/cncf/strimzi/_build/results?buildId=180961&view=artifacts&pathAsName=false&type=publishedArtifacts

@ppatierno
Copy link
Member

I had an investigation on this issue even related to the auto-rebalancing logic (where it's mostly failing in the STs above).
After CC is rolled, because brokers are finally scaled down (rebalancing was done), the KafkaRebalanceAssemblyOperator asks for the task status (in order to update the KafkaRebalance resource as Ready because rebalancing is done) but the Cruise Control JSON response is empty (CC was restarted without any memory of the previous running tasks).
By default we are going to re-issue a new rebalance proposal request, which doesn’t work in all scenarios (i.e. the remove_brokers is an example when the brokers to remove don’t exist anymore).

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java#L766

I think the only way is handling errors case by case, so re-issuing the rebalance proposal request could be ok the first time but if it returns a clearer error about what’s going wrong (i.e. a broker which does not exist), the KafkaRebalance should be updated in NotReady state with the error message. We already have the handling of this specific errors in the Cruise Control API implementation class but we are using it just for testing it, not in a real use case like this.

https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApiImpl.java#L224

What I am not sure right now is why it's not already updating the KafkaRebalance with the error on the new issued rebalance request.

For this specific example, the NotReady state could look as wrong because in the end the rebalancing happened, it’s just CC restarted and losing memory about that. But it seems the only way to go. Also, related to the auto-rebalancing, if the KafkaRebalance ends in NotReady state, it’s automatically deleted by the reconciler which is what we want.

So I think ending in NotReady state even during a manual rebalancing, the user can figure out that brokers were scaled down, the error in the KafkaRebalance reports that brokers don’t exist anymore so they can understand that rebalancing was done anyway and they can delete the resource.

@ShubhamRwt I hope the above makes sense and could be helpful to your resolution.

@ShubhamRwt
Copy link
Contributor

Thanks Paolo, yes it makes things more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment