Improve task manager warning log messaging when all workers are used #54920

peterschretlen · 2020-01-15T15:52:42Z

The message from TM that gets logged when all the workers are in use seems ... not great.

[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration

it's very long
it's annoying, and so that will lead the customer to update max_workers, which will cause a thundering herd issue as SIEM has seen

I don't really have a great list of concrete alternatives, but will discuss some ideas below
ideas:

have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)

I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-15T15:52:44Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-01-16T15:51:37Z

Beyond some counts and task execution time stats, I'd like to see the number of "queued" tasks. When running alerting under stress with just 10 workers that were all, where every worker was running an action (server-log), I think the actions were getting "backed up" in the queue. I was tracking when the actions were queued to run (alert executor time) and when the action was actually run (action executor time), and was seeing delays in the scale of minutes.

So, just a count of the number of docs in the tm SO store would be useful.

pmuellr · 2020-01-16T15:52:37Z

I was thinking a route to get stats would be useful as well. At least in dev mode. :-)

pmuellr · 2020-01-16T15:54:32Z

I'm guessing folks will be annoyed with messages from TM in the console on a regular interval. Seems like something we could turn into a config option logStats: boolean, easy for devs to set in the dev config, or in a customer setting for diagnostic purposes. We'd want to whitelist it for cloud - someday we'll have Kibana logs available there!

gmmorris · 2020-01-16T15:55:15Z

hmm, Task Manager doesn't have a concept of queued tasks as it asks for as many as it has capacity to handle, rather than count how many there are that it doesn't have capacity to handle.
That would require a separate query.

We could track the lead time (we do this in the perf tests) showing the average/p90 time between execution time and the scheduled runAt, so you'd get a sense for how much tasks get backed up by.

gmmorris · 2020-01-16T15:56:16Z

we could provide a separate cli that displays this rather than spam the log, like top for Kiabana Task Manager.

pmuellr · 2020-01-16T23:22:15Z

That would require a separate query (to get all the "queued" tasks)

ya, figured, since we don't really need that count for anything today. Fairly important number though, I think it would be fine to pay for a separate query for ad-hoc purposes (http api) or an hourly or so stats summary for logging/telemetry.

peterschretlen · 2020-01-17T16:15:43Z

++ to this, as an API to be called adhoc, and/or optional monitor you can turn in config. Agree that the "queued" number lets you make an informed decision about whether max_workers are set appropriately.

If the queue spikes but then goes down then you know there's enough workers, the problem is the distribution of the load is too spiky so there will be delayed execution (which may be fine).

But if the queue keeps growing to the point where it approaches the number of tasks in the system, you know you need more.

pmuellr · 2020-01-20T14:54:50Z

I feel like we need to do something for 7.6 here - SIEM users that use the alerts/signals feature will likely end up seeing this message, and so then update their max_workers to numbers which will cause problems. The easiest thing to do for 7.6 is to just not log the message.

gmmorris · 2020-01-20T15:57:50Z

Yeah, that's simple enough.
I'll do that now

gmmorris · 2020-01-20T16:05:37Z

PR: #55334

securigeek · 2020-03-30T13:00:24Z

Hello ,

Looking on the comments above, I got an understanding that the error message " Task Manager has skipped Claiming Ownership of available tasks at it has ran out Available Workers." is just a warning and will be removed since the solution to it as in increasing the workers could cause problems. I have a similar error message on my Kibana, could you please advise what to be done.

Thanks in advance.
Securigeek

pmuellr · 2020-03-31T21:45:45Z

@securigeek - this is just a diagnostic message from the Task Manager plugin. It's an indication that the Task Manager is busy running tasks, up to it's max worker size, so didn't ask for more tasks to execute. It asks for tasks on an interval, so as the tasks it's working on complete, on the subsequent rounds of the interval, it will start asking for more tasks to execute.

You are correct that you might be able to avoid seeing this message by increasing the max worker size, but also as you suggest ... this could cause problems.

For now, ignore the message.

I'll create a PR to disable the message.

Thanks for reporting!

resolves elastic#54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.

…ug (#62083) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.

…ug (elastic#62083) resolves elastic#54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.

…ug (#62083) (#62161) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.

…ug (#62083) (#62160) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.

peterschretlen added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 15, 2020

peterschretlen mentioned this issue Jan 15, 2020

[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697

Closed

gmmorris mentioned this issue Jan 20, 2020

removes CTA from Task Manager info message #55334

Merged

7 tasks

pmuellr self-assigned this Mar 31, 2020

pmuellr mentioned this issue Mar 31, 2020

[Task Manager] Change info message "ran out Available Workers" to debug #62083

Merged

pmuellr closed this as completed in #62083 Apr 1, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve task manager warning log messaging when all workers are used #54920

Improve task manager warning log messaging when all workers are used #54920

peterschretlen commented Jan 15, 2020

elasticmachine commented Jan 15, 2020

pmuellr commented Jan 16, 2020

pmuellr commented Jan 16, 2020

pmuellr commented Jan 16, 2020

gmmorris commented Jan 16, 2020

gmmorris commented Jan 16, 2020

pmuellr commented Jan 16, 2020

peterschretlen commented Jan 17, 2020

pmuellr commented Jan 20, 2020

gmmorris commented Jan 20, 2020

gmmorris commented Jan 20, 2020

securigeek commented Mar 30, 2020

pmuellr commented Mar 31, 2020

Improve task manager warning log messaging when all workers are used #54920

Improve task manager warning log messaging when all workers are used #54920

Comments

peterschretlen commented Jan 15, 2020

elasticmachine commented Jan 15, 2020

pmuellr commented Jan 16, 2020

pmuellr commented Jan 16, 2020

pmuellr commented Jan 16, 2020

gmmorris commented Jan 16, 2020

gmmorris commented Jan 16, 2020

pmuellr commented Jan 16, 2020

peterschretlen commented Jan 17, 2020

pmuellr commented Jan 20, 2020

gmmorris commented Jan 20, 2020

gmmorris commented Jan 20, 2020

securigeek commented Mar 30, 2020

pmuellr commented Mar 31, 2020