[Task Manager] Support for limited concurrency Task Types #54916

gmmorris · 2020-01-15T15:03:42Z

Describe the feature:
Task Manager used to be able to limit how many concurrent instances of a specific task type run on a single Kibana instance.
We have also identified that there might be need to limit the concurrency of specific tasks (or groups of tasks), as alert types also want to synamically limit how many instances of a certain type can run concurrently.

Describe a specific use case for the feature:
We need to bring this feature back for Scheduled tasks and possibly others such as SIEM.

Edit / Note: There isn't currently a need to support this at the alert type level but definitely at the task manager level for reporting purposes.

elasticmachine · 2020-01-15T15:03:44Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-01-16T20:16:35Z

For actions, and I think alerts, we create a new task type per action type. It may make sense to be able to set the max_workers to all of actions, by being able to say "only run 10 tasks of type action:.* or something - all the action taskTypes start with action:. Another set of knobs and dials, but it's coarser than per actionType exactly, and so would be easier to configure for customers, vs having to configure every single actionType.

Alternatively, we could probably also just have one taskType for all actions, and plumb more data into it - not sure what the pros/cons are to that.

gmmorris · 2020-01-17T08:40:44Z

Part of the complication is in how TM claims tasks - we don't want to lose cycles where we claim 10 and then drop them because's we're at capacity with that specific type, but have capacity for others.
We need to see if we can find a solution that can be applied in the query within ES.

tsullivan · 2020-07-07T22:21:52Z

Would it also be possible to use these settings to configure TM to completely disable itself from claiming a certain task type?

Maybe that could be the same as setting the allowed concurrent tasks of a type to 0.

If Reporting uses Task Manager and I have an instance that I don't want to be able to execute Reports, this setting would give me what I need.

pmuellr · 2020-07-09T14:12:08Z

Maybe that could be the same as setting the allowed concurrent tasks of a type to 0.

That makes sense, but we will probably want an info message about this on at startup, for diagnostic purposes. Eg, someone uses 0 on all instances, and then wonders why those tasks never run.

gmmorris · 2020-07-22T11:23:19Z

Came up with a possible direction, details over here:
#71441 (comment)
#71441 (comment)

If @tsullivan & @joelgriffith feel this adequately addresses their needs and @elastic/kibana-alerting-services like the direction, then we can consider pulling this issue into the To Do list I think.

gmmorris · 2020-10-28T18:23:33Z

Having discussed the issue with Alerting Services and Reporting, we've decided to go the route of adding limited support for concurrency which will specifically support Reporting, but we won't allow other task types to utilise it for the time being to avoid adding too many additional pollers.

We feel comfortable adding a second poller for Reporting as they'll be removing their use of ES queue in that same version, meaning that, in effect, there's the same number of polls running in parallel as before.

This work will follow the path spiked over here: #74883

pmuellr · 2020-10-28T20:15:00Z

I'm wondering if we would want to reframe this as a "one concurrent task poller", compared to just reporting. Would be for "large/expensive" tasks. Reporting today, probably more tomorrow ...

tsullivan · 2020-10-29T17:30:18Z

"one concurrent task poller"

That makes sense to me. Allow any app or service to register a "large/expensive" task definition, and the secondary poller could search for these tasks with a size of 1. Whichever large task has been waiting the longest would get singularly claimed with each poll interval. Scaling up with multiple instances of Kibana would help with keeping a backlog down. Perhaps the interval duration could be configurable if the machine has the hardware to do more work on the backlog.

gmmorris · 2021-02-01T10:21:12Z

I spent a couple of days on this last week, and came to the conclusion that the direction the POC from a few months ago took was right, but the "fork" point, where we duplicate the mechanism was quite a bit off.

In the POC we forked at the root of the Poller - which means the entire interval mechanism was duplicated.
At the time that seemed sufficient, but I hadn't taken into account a couple of factors:

The interval mechanism reactively consumes a queue of ClaimByID calls, used to support the RunNow functionality. Forking at such an early point means that this queue gets duplicated and each poller tries to fetch that task independently. (even if each poller can only fetch tasks of it's own type, this becomes problematic due to an inability to know why we didn't pick it up in that poller without adding an additional get call on that task... and a bunch of other complications...)
There's a race condition you can easily hit where different pollers are competing for the same TaskPool and syncing these pollers in such a way that they don't clash becomes kind of spaghetti.

Since then, we've also added a variety of other mechanisms into that stream that get duplicated as a result:

The reactive change of the polling interval and max workers when Elasticsearch gets overloaded
The monitoring and event hooks thropughout the pipeline
The reactive shifting in response to version conflicts
Marking unknown Task Types as "Unrecognized" - this doesn't get duplicated, rather it breaks as a query marks a known task as "Unrecognized" because that query doesn't support that type. (this has now been addressed, but this shows another class of complexities introduced by this issue ;))

Duplicating these processes makes it much harder to reason about what's happening in TM, and over cplicates our monitoring solution.

Once I realized that our original; approach was no longer suitable I spiked another POC and found a much better place to fork the process.
I'm now going for a "back-to-back updateByQuery stage" where we execute these queries sequentially, emitting the tasks we claim as we go. This seems to work far better, and addresses some of the coordination challenges that multiple pollers introduces, by making the surface area of the "duplicated" processes smaller (this should also make maintenance easier).

In addition, a list of other complications we hadn't quite considered revealed themselves (these are true no matter the forking point):

Running separate queries for different task types makes it possible for one task type to starve another, meaning we could end up with old tasks of one type not being picked up while newer types get picked up from tasks whose query ran first. To address this I shuffle the order in which we execute these on each cycle. This does mean we can sometimes claim a slightly newer task over an older one in one cycle, but that should correct itself on the next.
We designed RunNow so that it's calls should always take place precedence over scheduled tasks. Limiting the type of task a poller can pick up can break this requirement causing our RunNow functionality (and luckily, its tests too) to become flaky. I've addressed this by allowing the ClaimById to ignore type, and so the first query (no matter the type) should pick these up by their ID.
Calling runNow could cause a task to be picked up even when there is no capacity in the Kibana instance. To get around this we could simply check for capacity after claiming the task, but then we'd return an error even if another Kibana could have capacity to pick it up. To address this I've changed the mechanism so that if runNow claims a task we have no capacity to run, we update it back to idle with a runAt of epoch so that it'll get picked up by by the next available Kibana cycle. There is another option, which is to fetch the task before the claiming cycle to get its type, which would avoid picking up the task in the first place, but that would mean we still reject the runNow despite the possibility that another Kibana could pick it up. Additionally, it would slow down the runNow functionality which would impact all usage of runNow, no matter the type (unnecessarily in my opinion).

I think we're now on the right path. I've been able to get the entire e2e test suite green on my spike (but it's hacked together so all unit tests are red 😆 ), and need to add some additional ones to test for edge cases, but I'm feeling confident about this direction.
Hopefully we'll have a PR up by end of week 🤞

tsullivan · 2021-02-03T18:28:18Z

Great read, Gidi! Thank you for the hard work going into this.

gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 15, 2020

mikecote assigned gmmorris Jan 15, 2020

gmmorris mentioned this issue Jan 16, 2020

[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697

Closed

gmmorris changed the title ~~[Task Manager] Support for limited concurrency Task Types~~ [Discuss] [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types Jan 17, 2020

mikecote unassigned gmmorris Feb 18, 2020

mikecote added the Feature:Task Manager label Jun 9, 2020

mikecote mentioned this issue Jun 9, 2020

[Reporting] Schedule Reports with Task Manager #53900

Closed

mikecote mentioned this issue Jul 13, 2020

Research adding concurrency support to Task Manager #71441

Closed

mikecote mentioned this issue Jul 22, 2020

Eliminate the downtime between tasks completing and the next polling interval #65552

Closed

tsullivan mentioned this issue Aug 3, 2020

Switch Reporting to Task Manager #64853

Merged

7 tasks

mikecote mentioned this issue Aug 13, 2020

Task Manager v2 #42055

Closed

mikecote added the enhancement New value added to drive a business result label Aug 19, 2020

gmmorris changed the title ~~[Discuss] [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types~~ [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types Oct 28, 2020

tsullivan added the (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead label Dec 17, 2020

mikecote changed the title ~~[Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types~~ [Task Manager / Alerting] Support for limited concurrency Task Types Jan 6, 2021

mikecote changed the title ~~[Task Manager / Alerting] Support for limited concurrency Task Types~~ [Task Manager] Support for limited concurrency Task Types Jan 6, 2021

gmmorris mentioned this issue Jan 18, 2021

[Meta][Task Manager] It's possible to clog up the Task Manager's throughput #88625

Closed

gmmorris self-assigned this Jan 21, 2021

gmmorris mentioned this issue Feb 4, 2021

[Task manager] Adds support for limited concurrency tasks #90365

Merged

3 tasks

gmmorris closed this as completed in #90365 Feb 11, 2021

mikecote mentioned this issue Feb 24, 2021

Dependencies on Kibana Alerting #67992

Open

59 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Support for limited concurrency Task Types #54916

[Task Manager] Support for limited concurrency Task Types #54916

gmmorris commented Jan 15, 2020 •

edited by mikecote

Loading

elasticmachine commented Jan 15, 2020

pmuellr commented Jan 16, 2020

gmmorris commented Jan 17, 2020

tsullivan commented Jul 7, 2020

pmuellr commented Jul 9, 2020

gmmorris commented Jul 22, 2020

gmmorris commented Oct 28, 2020

pmuellr commented Oct 28, 2020

tsullivan commented Oct 29, 2020

gmmorris commented Feb 1, 2021 •

edited

Loading

tsullivan commented Feb 3, 2021

[Task Manager] Support for limited concurrency Task Types #54916

[Task Manager] Support for limited concurrency Task Types #54916

Comments

gmmorris commented Jan 15, 2020 • edited by mikecote Loading

elasticmachine commented Jan 15, 2020

pmuellr commented Jan 16, 2020

gmmorris commented Jan 17, 2020

tsullivan commented Jul 7, 2020

pmuellr commented Jul 9, 2020

gmmorris commented Jul 22, 2020

gmmorris commented Oct 28, 2020

pmuellr commented Oct 28, 2020

tsullivan commented Oct 29, 2020

gmmorris commented Feb 1, 2021 • edited Loading

tsullivan commented Feb 3, 2021

gmmorris commented Jan 15, 2020 •

edited by mikecote

Loading

gmmorris commented Feb 1, 2021 •

edited

Loading