Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Support for limited concurrency Task Types #54916

Closed
Tracked by #64853
gmmorris opened this issue Jan 15, 2020 · 11 comments · Fixed by #90365
Closed
Tracked by #64853

[Task Manager] Support for limited concurrency Task Types #54916

gmmorris opened this issue Jan 15, 2020 · 11 comments · Fixed by #90365
Assignees
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead enhancement New value added to drive a business result Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Jan 15, 2020

Describe the feature:
Task Manager used to be able to limit how many concurrent instances of a specific task type run on a single Kibana instance.
We have also identified that there might be need to limit the concurrency of specific tasks (or groups of tasks), as alert types also want to synamically limit how many instances of a certain type can run concurrently.

Describe a specific use case for the feature:
We need to bring this feature back for Scheduled tasks and possibly others such as SIEM.

Edit / Note: There isn't currently a need to support this at the alert type level but definitely at the task manager level for reporting purposes.

@gmmorris gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 15, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

For actions, and I think alerts, we create a new task type per action type. It may make sense to be able to set the max_workers to all of actions, by being able to say "only run 10 tasks of type action:.* or something - all the action taskTypes start with action:. Another set of knobs and dials, but it's coarser than per actionType exactly, and so would be easier to configure for customers, vs having to configure every single actionType.

Alternatively, we could probably also just have one taskType for all actions, and plumb more data into it - not sure what the pros/cons are to that.

@gmmorris gmmorris changed the title [Task Manager] Support for limited concurrency Task Types [Discuss] [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types Jan 17, 2020
@gmmorris
Copy link
Contributor Author

Part of the complication is in how TM claims tasks - we don't want to lose cycles where we claim 10 and then drop them because's we're at capacity with that specific type, but have capacity for others.
We need to see if we can find a solution that can be applied in the query within ES.

@tsullivan
Copy link
Member

Would it also be possible to use these settings to configure TM to completely disable itself from claiming a certain task type?

Maybe that could be the same as setting the allowed concurrent tasks of a type to 0.

If Reporting uses Task Manager and I have an instance that I don't want to be able to execute Reports, this setting would give me what I need.

@pmuellr
Copy link
Member

pmuellr commented Jul 9, 2020

Maybe that could be the same as setting the allowed concurrent tasks of a type to 0.

That makes sense, but we will probably want an info message about this on at startup, for diagnostic purposes. Eg, someone uses 0 on all instances, and then wonders why those tasks never run.

@gmmorris
Copy link
Contributor Author

Came up with a possible direction, details over here:
#71441 (comment)
#71441 (comment)

If @tsullivan & @joelgriffith feel this adequately addresses their needs and @elastic/kibana-alerting-services like the direction, then we can consider pulling this issue into the To Do list I think.

@mikecote mikecote added the enhancement New value added to drive a business result label Aug 19, 2020
@gmmorris gmmorris changed the title [Discuss] [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types Oct 28, 2020
@gmmorris
Copy link
Contributor Author

Having discussed the issue with Alerting Services and Reporting, we've decided to go the route of adding limited support for concurrency which will specifically support Reporting, but we won't allow other task types to utilise it for the time being to avoid adding too many additional pollers.

We feel comfortable adding a second poller for Reporting as they'll be removing their use of ES queue in that same version, meaning that, in effect, there's the same number of polls running in parallel as before.

This work will follow the path spiked over here: #74883

@pmuellr
Copy link
Member

pmuellr commented Oct 28, 2020

I'm wondering if we would want to reframe this as a "one concurrent task poller", compared to just reporting. Would be for "large/expensive" tasks. Reporting today, probably more tomorrow ...

@tsullivan
Copy link
Member

"one concurrent task poller"

That makes sense to me. Allow any app or service to register a "large/expensive" task definition, and the secondary poller could search for these tasks with a size of 1. Whichever large task has been waiting the longest would get singularly claimed with each poll interval. Scaling up with multiple instances of Kibana would help with keeping a backlog down. Perhaps the interval duration could be configurable if the machine has the hardware to do more work on the backlog.

@tsullivan tsullivan added the (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead label Dec 17, 2020
@mikecote mikecote changed the title [Task Manager / Alerting] Support for limited concurrency Task Types & Alert Types [Task Manager / Alerting] Support for limited concurrency Task Types Jan 6, 2021
@mikecote mikecote changed the title [Task Manager / Alerting] Support for limited concurrency Task Types [Task Manager] Support for limited concurrency Task Types Jan 6, 2021
@gmmorris gmmorris self-assigned this Jan 21, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented Feb 1, 2021

I spent a couple of days on this last week, and came to the conclusion that the direction the POC from a few months ago took was right, but the "fork" point, where we duplicate the mechanism was quite a bit off.

In the POC we forked at the root of the Poller - which means the entire interval mechanism was duplicated.
At the time that seemed sufficient, but I hadn't taken into account a couple of factors:

  1. The interval mechanism reactively consumes a queue of ClaimByID calls, used to support the RunNow functionality. Forking at such an early point means that this queue gets duplicated and each poller tries to fetch that task independently. (even if each poller can only fetch tasks of it's own type, this becomes problematic due to an inability to know why we didn't pick it up in that poller without adding an additional get call on that task... and a bunch of other complications...)
  2. There's a race condition you can easily hit where different pollers are competing for the same TaskPool and syncing these pollers in such a way that they don't clash becomes kind of spaghetti.

Since then, we've also added a variety of other mechanisms into that stream that get duplicated as a result:

  1. The reactive change of the polling interval and max workers when Elasticsearch gets overloaded
  2. The monitoring and event hooks thropughout the pipeline
  3. The reactive shifting in response to version conflicts
  4. Marking unknown Task Types as "Unrecognized" - this doesn't get duplicated, rather it breaks as a query marks a known task as "Unrecognized" because that query doesn't support that type. (this has now been addressed, but this shows another class of complexities introduced by this issue ;))

Duplicating these processes makes it much harder to reason about what's happening in TM, and over cplicates our monitoring solution.

Once I realized that our original; approach was no longer suitable I spiked another POC and found a much better place to fork the process.
I'm now going for a "back-to-back updateByQuery stage" where we execute these queries sequentially, emitting the tasks we claim as we go. This seems to work far better, and addresses some of the coordination challenges that multiple pollers introduces, by making the surface area of the "duplicated" processes smaller (this should also make maintenance easier).

In addition, a list of other complications we hadn't quite considered revealed themselves (these are true no matter the forking point):

  1. Running separate queries for different task types makes it possible for one task type to starve another, meaning we could end up with old tasks of one type not being picked up while newer types get picked up from tasks whose query ran first. To address this I shuffle the order in which we execute these on each cycle. This does mean we can sometimes claim a slightly newer task over an older one in one cycle, but that should correct itself on the next.
  2. We designed RunNow so that it's calls should always take place precedence over scheduled tasks. Limiting the type of task a poller can pick up can break this requirement causing our RunNow functionality (and luckily, its tests too) to become flaky. I've addressed this by allowing the ClaimById to ignore type, and so the first query (no matter the type) should pick these up by their ID.
  3. Calling runNow could cause a task to be picked up even when there is no capacity in the Kibana instance. To get around this we could simply check for capacity after claiming the task, but then we'd return an error even if another Kibana could have capacity to pick it up. To address this I've changed the mechanism so that if runNow claims a task we have no capacity to run, we update it back to idle with a runAt of epoch so that it'll get picked up by by the next available Kibana cycle. There is another option, which is to fetch the task before the claiming cycle to get its type, which would avoid picking up the task in the first place, but that would mean we still reject the runNow despite the possibility that another Kibana could pick it up. Additionally, it would slow down the runNow functionality which would impact all usage of runNow, no matter the type (unnecessarily in my opinion).

I think we're now on the right path. I've been able to get the entire e2e test suite green on my spike (but it's hacked together so all unit tests are red 😆 ), and need to add some additional ones to test for edge cases, but I'm feeling confident about this direction.
Hopefully we'll have a PR up by end of week 🤞

@tsullivan
Copy link
Member

Great read, Gidi! Thank you for the hard work going into this.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead enhancement New value added to drive a business result Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants