Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

Open
tinaselenge opened this issue Oct 11, 2024 · 3 comments · May be fixed by #10768
Open
Assignees
Labels

Comments

@tinaselenge
Copy link
Contributor

tinaselenge commented Oct 11, 2024

Bug Description

When getting status of rebalance task via /users_tasks, it could return 500 with an error such as:

2024-10-09 15:44:34 ERROR KafkaCruiseControlRequestHandler:88 - Error processing GET request '/user_tasks' due to: 'There are already 5 active user tasks, which has reached the servlet capacity.'.
java.lang.RuntimeException: There are already 5 active user tasks, which has reached the servlet capacity.

This has nothing to do with the actual rebalance task itself, as it is still maybe in progress. This seems to be a failure in generating a new user task for getting the status. When one of the existing user tasks complete, it gets removed from the active user task list e.g:

2024-10-09 15:44:36 INFO  UserTaskManager:349 - UserTask 7e280130-47d2-4940-99da-f57f117c3f26 is completed and removed from active tasks list

Once an existing task is completed and removed, we should be able to send a request to /users_tasks without hitting 500. Since this failure does not reflect the actual status of the rebalance task that we are trying to query about, I don't think it makes sense to result in "NotReady" for the KafkaRebalance. We should maybe retry the endpoint again, in the next reconciliation.

Steps to reproduce

Create KafkaRebalance CR for removing/adding brokers with auto approve set, and then immediately apply refresh annotation to create a new rebalance task. This is an intermittent failure depending on how quickly tasks complete.

Expected behavior

No response

Strimzi version

main

Kubernetes version

1.29

Installation method

No response

Infrastructure

No response

Configuration files and logs

No response

Additional context

No response

@ppatierno
Copy link
Member

That's interesting because as stated in #10701, we see errors coming from CC to get ignored and not reported as KR in NotReady state.

@ppatierno
Copy link
Member

Ignore last comment ;-) We were wrong.

Said that good catch @tinaselenge. I think it could be make easily reproducible by shortening the max.active.user.tasks when configuring Cruise Control in the Kafka custom resource. Its value is 5 which is exactly what you have.

@ppatierno
Copy link
Member

Triaged on 17/10/2024: agreed to fix this, at least not moving the KafkaRebalance to NotReady state straight when it happens but waiting for next reconciliation(s) as retries. @tinaselenge is going to take a look at it. Thanks Tina!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants