Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve task manager warning log messaging when all workers are used #54920

Closed
peterschretlen opened this issue Jan 15, 2020 · 13 comments · Fixed by #62083
Closed

Improve task manager warning log messaging when all workers are used #54920

peterschretlen opened this issue Jan 15, 2020 · 13 comments · Fixed by #62083
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@peterschretlen
Copy link
Contributor

From #54697 (comment)

The message from TM that gets logged when all the workers are in use seems ... not great.

[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration
  • it's very long
  • it's annoying, and so that will lead the customer to update max_workers, which will cause a thundering herd issue as SIEM has seen

I don't really have a great list of concrete alternatives, but will discuss some ideas below
ideas:

  • have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
  • have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)

I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.

@peterschretlen peterschretlen added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 15, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

Beyond some counts and task execution time stats, I'd like to see the number of "queued" tasks. When running alerting under stress with just 10 workers that were all, where every worker was running an action (server-log), I think the actions were getting "backed up" in the queue. I was tracking when the actions were queued to run (alert executor time) and when the action was actually run (action executor time), and was seeing delays in the scale of minutes.

So, just a count of the number of docs in the tm SO store would be useful.

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

I was thinking a route to get stats would be useful as well. At least in dev mode. :-)

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

I'm guessing folks will be annoyed with messages from TM in the console on a regular interval. Seems like something we could turn into a config option logStats: boolean, easy for devs to set in the dev config, or in a customer setting for diagnostic purposes. We'd want to whitelist it for cloud - someday we'll have Kibana logs available there!

@gmmorris
Copy link
Contributor

hmm, Task Manager doesn't have a concept of queued tasks as it asks for as many as it has capacity to handle, rather than count how many there are that it doesn't have capacity to handle.
That would require a separate query.

We could track the lead time (we do this in the perf tests) showing the average/p90 time between execution time and the scheduled runAt, so you'd get a sense for how much tasks get backed up by.

@gmmorris
Copy link
Contributor

we could provide a separate cli that displays this rather than spam the log, like top for Kiabana Task Manager.

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

That would require a separate query (to get all the "queued" tasks)

ya, figured, since we don't really need that count for anything today. Fairly important number though, I think it would be fine to pay for a separate query for ad-hoc purposes (http api) or an hourly or so stats summary for logging/telemetry.

@peterschretlen
Copy link
Contributor Author

++ to this, as an API to be called adhoc, and/or optional monitor you can turn in config. Agree that the "queued" number lets you make an informed decision about whether max_workers are set appropriately.

If the queue spikes but then goes down then you know there's enough workers, the problem is the distribution of the load is too spiky so there will be delayed execution (which may be fine).

But if the queue keeps growing to the point where it approaches the number of tasks in the system, you know you need more.

@pmuellr
Copy link
Member

pmuellr commented Jan 20, 2020

I feel like we need to do something for 7.6 here - SIEM users that use the alerts/signals feature will likely end up seeing this message, and so then update their max_workers to numbers which will cause problems. The easiest thing to do for 7.6 is to just not log the message.

@gmmorris
Copy link
Contributor

Yeah, that's simple enough.
I'll do that now

@gmmorris
Copy link
Contributor

PR: #55334

@securigeek
Copy link

Hello ,

Looking on the comments above, I got an understanding that the error message " Task Manager has skipped Claiming Ownership of available tasks at it has ran out Available Workers." is just a warning and will be removed since the solution to it as in increasing the workers could cause problems. I have a similar error message on my Kibana, could you please advise what to be done.

Thanks in advance.
Securigeek

@pmuellr
Copy link
Member

pmuellr commented Mar 31, 2020

@securigeek - this is just a diagnostic message from the Task Manager plugin. It's an indication that the Task Manager is busy running tasks, up to it's max worker size, so didn't ask for more tasks to execute. It asks for tasks on an interval, so as the tasks it's working on complete, on the subsequent rounds of the interval, it will start asking for more tasks to execute.

You are correct that you might be able to avoid seeing this message by increasing the max worker size, but also as you suggest ... this could cause problems.

For now, ignore the message.

I'll create a PR to disable the message.

Thanks for reporting!

@pmuellr pmuellr self-assigned this Mar 31, 2020
pmuellr added a commit to pmuellr/kibana that referenced this issue Mar 31, 2020
resolves elastic#54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
pmuellr added a commit that referenced this issue Apr 1, 2020
…ug (#62083)

resolves #54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
pmuellr added a commit to pmuellr/kibana that referenced this issue Apr 1, 2020
…ug (elastic#62083)

resolves elastic#54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
pmuellr added a commit to pmuellr/kibana that referenced this issue Apr 1, 2020
…ug (elastic#62083)

resolves elastic#54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
pmuellr added a commit that referenced this issue Apr 1, 2020
…ug (#62083) (#62161)

resolves #54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
pmuellr added a commit that referenced this issue Apr 1, 2020
…ug (#62083) (#62160)

resolves #54920

The scenario this message warns about is actually something that we should
expect to happen in a busy system - and we back off as we should.  But
there's no reason it needs to be info-logged, especially as it seems
somewhat frightening.

Changing to debug, so in case we do need this for diagnostic purposes, we
can still get it via logging config.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants