[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

tinaselenge · 2024-10-11T16:07:47Z

Bug Description

When getting status of rebalance task via /users_tasks, it could return 500 with an error such as:

2024-10-09 15:44:34 ERROR KafkaCruiseControlRequestHandler:88 - Error processing GET request '/user_tasks' due to: 'There are already 5 active user tasks, which has reached the servlet capacity.'.
java.lang.RuntimeException: There are already 5 active user tasks, which has reached the servlet capacity.

This has nothing to do with the actual rebalance task itself, as it is still maybe in progress. This seems to be a failure in generating a new user task for getting the status. When one of the existing user tasks complete, it gets removed from the active user task list e.g:

2024-10-09 15:44:36 INFO  UserTaskManager:349 - UserTask 7e280130-47d2-4940-99da-f57f117c3f26 is completed and removed from active tasks list

Once an existing task is completed and removed, we should be able to send a request to /users_tasks without hitting 500. Since this failure does not reflect the actual status of the rebalance task that we are trying to query about, I don't think it makes sense to result in "NotReady" for the KafkaRebalance. We should maybe retry the endpoint again, in the next reconciliation.

Steps to reproduce

Create KafkaRebalance CR for removing/adding brokers with auto approve set, and then immediately apply refresh annotation to create a new rebalance task. This is an intermittent failure depending on how quickly tasks complete.

Expected behavior

No response

Strimzi version

main

Kubernetes version

1.29

Installation method

No response

Infrastructure

No response

Configuration files and logs

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

ppatierno · 2024-10-14T07:46:38Z

That's interesting because as stated in #10701, we see errors coming from CC to get ignored and not reported as KR in NotReady state.

ppatierno · 2024-10-15T10:40:38Z

Ignore last comment ;-) We were wrong.

Said that good catch @tinaselenge. I think it could be make easily reproducible by shortening the max.active.user.tasks when configuring Cruise Control in the Kafka custom resource. Its value is 5 which is exactly what you have.

ppatierno · 2024-10-17T08:44:34Z

Triaged on 17/10/2024: agreed to fix this, at least not moving the KafkaRebalance to NotReady state straight when it happens but waiting for next reconciliation(s) as retries. @tinaselenge is going to take a look at it. Thanks Tina!

tinaselenge added bug needs-triage labels Oct 11, 2024

ppatierno removed the needs-triage label Oct 17, 2024

ppatierno assigned tinaselenge Oct 17, 2024

tinaselenge linked a pull request Oct 28, 2024 that will close this issue

Handle the failure due to reaching the servlet capacity when getting user tasks #10768

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

tinaselenge commented Oct 11, 2024 •

edited

Loading

ppatierno commented Oct 14, 2024

ppatierno commented Oct 15, 2024

ppatierno commented Oct 17, 2024

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

Comments

tinaselenge commented Oct 11, 2024 • edited Loading

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

ppatierno commented Oct 14, 2024

ppatierno commented Oct 15, 2024

ppatierno commented Oct 17, 2024

tinaselenge commented Oct 11, 2024 •

edited

Loading