Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697

Closed
gmmorris opened this issue Jan 14, 2020 · 23 comments
Closed
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM technical debt Improvement of the software architecture and operational architecture

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Jan 14, 2020

Following some testing by SIEM (@rylnd and @FrankHassanabad) we've encountered an issue where ES couldn't keep up with the requests made by the Alerting service.

@FrankHassanabad 's descriptions of the situation:

The interactions between alerting and task manager we experienced was if we bulk turn on our 277 rules immediately and set the max_workers to a value of say 300 then the rules all run immediately and we get 300 searches racing towards ES which can then cause 429 too many requests, or misc issues such as timeouts.
We can stagger the turning on of the rules when we get a bulk request if we need to from our end to avoid the 429 issues. However, the other effect we experienced was if you turn on all 277 rules slowly over time everything runs fine because they are slightly spread out during execution. However, if you shut down your single Kibana instance for more than the interval (5 minutes in our case) then the next time you restart Kibana and have your max_workers set to a high number of 300, it will execute all 277 as fast as it can and cause 429 too many requests, timeouts, etc... to happen. Then it will repeat this pattern every 5 minutes on the dot since they're all scheduled for 5 minutes and cause a repeated 429 too many requests or other misc issues since it surges at the same time.
Basically the rules all "clump" together if you shut down for more than 5 minutes rather than spreading them a bit on restart after X amount of time.

Questioning the validity of setting a max_workers thats higher than what the ES cluster can handle, @FrankHassanabad mentions that 429 too many requests might not actually mean that the cluster can't handle that many open requests, but rather that they have been throttled by ES to prevent a spike of calls flooding the system all at once (rapid fire). If that's the case, then staggering the requests made by Kibana could address this.
I'm not familiar enough with ES to advise on that, perhaps someone else knows?

Open questions:

  1. Could 429s be reduces by throttling how many concurrent requests Alerting allows by Alert Executor's?
  2. Usually you'd address this kind of thing using circuit breakers on the Kibana side, and essentially just mark the alert as failed so it retries when capacity is restored. Is this a direction we should take? Should this be handled at Kibana level? Or Alerting specifically?
  3. How would we want to document this in order to address this at a system level? This is the kind of problem we should expect coming up in many large deployments, we should probably document this suggested deployment patterns that can handle this (both for customers, and for Cloud).
@gmmorris gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:SIEM labels Jan 14, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

@peterschretlen
Copy link
Contributor

peterschretlen commented Jan 14, 2020

max_workers is effectively a throttle, no? It determines how many tasks can run concurrently.

if the bulk of the task load is (mostly) querying Elasticsearch and (occasionally) indexing signal data, then I think a good starting point would be to set max_workers to match the throughput of the cluster.

A rough measure for something like query throughput is the default search thread pool size for nodes on a cluster, which is

int((# of cpu on a node * 3) / 2) + 1

So if you have 32 CPU nodes, you'll have 49 threads for searches.
A value like 40 might be good in that case for max_workers, leaving some room for other queries. That assumes there's only queries being done. I believe indexing is being done as well.

429s/503s will be due to the throughput of the cluster: Once all the threads in a pool are used on elasticsearch, the requests are queued. If the active requests take a long time, the queue will fill up and start rejecting, which you'll see as 429 or 503 depending on the API.

This assumes all tasks put load on Elasticsearch, which is not true ( actions for instance won't place much load, with the exception of indexing actions). At some point it may be worth having a different pool of workers for tasks that hit Elasticsearch.

@pmuellr
Copy link
Member

pmuellr commented Jan 14, 2020

Some unorganized notes/thoughts:

  • max_workers is the maximum number of concurrent workers for a single Kibana instance to handle alert and action executors (and in the future other things like reporting). When scaling Kibana horizontally (more server instances), you'll get (# of Kibana instances) * max_workers total workers across the Kibana "cluster".

  • The README.md talks about a numWorkers property that you could perhaps use to do some per-type throttling of picking workers to run, but it appears to be vestigial - IIRC we removed support for it at some point in time. That's fine, because it didn't really make sense the way it was described; it would be better to have a per-type max_workers (per Kibana instance)

  • Node.js itself should be fine handling the details of running 1000's of concurrent workers, but it appears obvious (now) that if those workers are handling 1000's of ES calls, we're going to kill ES.

  • ES returning a 429 is fantastic, but let's find out what it really means. I'm guessing ES has determined it's under a high load and shouldn't take on more queries to run. We obviously want to be running just under the "429 limit", figuring out how to do that will be interesting.

  • Applying throttling in the callCluster we provide to executors seems like it makes sense

  • There's also poll_interval, which indicates how often TM looks for new work, set to 3 seconds. IIRC, there is a limit to how low you can get this to run today with the current TM architecture. Some of the alternatives architectures (in PRs we never merged) let you get the interval lower. I don't remember the numbers, but thinking I've seen sub-second with the alt-architecture PRs, but can only get about 2 seconds with the current architecture.

  • Will cron scheduling help with the staggering? You could imagine SIEM "managing" that by perhaps picking random offsets for the seconds value in a cron expression. Or perhaps we could even provide or force a random second offset for minute- and hourly- based cron expressions.

@gmmorris
Copy link
Contributor Author

  • The README.md talks about a numWorkers property that you could perhaps use to do some per-type throttling of picking workers to run, but it appears to be vestigial - IIRC we removed support for it at some point in time. That's fine, because it didn't really make sense the way it was described; it would be better to have a per-type max_workers (per Kibana instance)

This is indeed vestigial, but it's coming back in 7.7 :)

@pmuellr
Copy link
Member

pmuellr commented Jan 14, 2020

The message from TM that gets logged when all the workers are in use seems ... not great.

[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration
  • it's very long
  • it's annoying, and so that will lead the customer to update max_workers, which will cause a thundering herd issue as SIEM has seen
  • I don't really have a great list of concrete alternatives, but will discuss some ideas below

ideas:

  • have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
  • have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)

I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.

@pmuellr
Copy link
Member

pmuellr commented Jan 14, 2020

I have a stress test for alerts that uses some other plugins/tools I've built for heartbeats that makes essql calls - in a gist here: https://gist.github.com/pmuellr/29042ba4fd58f8e4c088d3b0a703da2e

One of the interesting things running this with 100 alerts at a 1 sec interval with 10 workers, is that the action execution is about 30 seconds after the action is scheduled to run in the alert. With 1000 alerts at 1 sec interval, the delay is getting up to 2.5 minutes.

I imagine because it basically gets put at the "end of the queue". That seems not great to me, wondering if action tasks should always have priority over alert tasks. That doesn't seem perfect either, for expensive actions. I sounds nice to maybe have a "priority" system here, where I could set my alerts at a lower priority than my actions, but it's complicated ™️

Catching up to master so I can change the max_workers, I ran 100 alerts at 1 sec interval with 100 workers, and it's a lot more lively in that there's on 3 seconds latency from when actions are scheduled till when they are executed. Kibana is running at 150-200% CPU, 1GB RAM, ES at 80-150% CPU, 1GB RAM.

@peterschretlen
Copy link
Contributor

It seems like there are 2 throughput concerns to consider:

  • Task manager throughput. max_workers creates a throughput limit, which may be artificial, as SIEM found in their tests. Throttling would cause gaps would appear in detection, yet the system was capable of scaling beyond that limit.
  • Downstream system throughput. Tasks may hit systems that also have throughput limits ( API rate limits, or system loads like the Elasticsearch 429 errors). SIEM hit this issue in testing by setting a very high max_workers which moved the source of back-pressure from task manager to elasticsearch. The throughput may even vary with the part of the system being used - are we getting errors from elasticsearch indexing? or elasticsearch queries? We may be hitting limits on one but not the other.

This creates a few problems today:

  1. The max_workers default of 10 as a global throttle is going to be too low for many SIEM systems and likely other deployments. What's the right value? Set too high it could cripple the Elasticsearch cluster. Set too low it will leave gaps in detection, which will only get worse as other apps add alerts.

  2. There's no way to deal with downstream system throughput limits. numWorkers will work at a task type level, but there might be many task types that use Elasticsearch, and they may use it in different ways (indexing vs search, or both in the case of SIEM).

@pmuellr
Copy link
Member

pmuellr commented Jan 14, 2020

I feel like 10 is way too low. 100 seems about right to me. May be too big for some use cases (CPU /resource expensive alerts | actions).

numWorkers doesn't seem great since it doesn't "scale" with max_workers. if max_workers is 10, you could set reporting tasks to numWorkers 5 to limit running 2 reports at a time, but then if max_workers gets reset to 100, now you can run 20, which will probably kill your system. That's why "max_workers per type" makes a bit more sense to me.

@gmmorris
Copy link
Contributor Author

Regarding numWorkers: my direction of thought was to scrap the old model and shift to a model where Task Type Definitions specify how many concurrent tasks can be executed - so it's would have nothing to do with max_workers.
The reason for this is that using numWorkers, as Patrick pointed out, is scaled by max_workers and logically that means you can't say "I want only 2 reporting tasks, but as many Actions as possible".
Another reason is that it makes it hard to only claim as many tasks as you have capacity for - you'd often end up claiming too many and then locking them off from other Kibana instances until you finish processing the preceding tasks. This harmed performance.

@gmmorris
Copy link
Contributor Author

I feel like 10 is way too low. 100 seems about right to me. May be too big for some use cases (CPU /resource expensive alerts | actions).

Agreed, and I don't think we have any reason to keep it at 10, as I think it was only there in the first place to prevent multiple Reporting tasks from running on the same Kibana, and that doesn't actually work anymore (Reporting are aware and okayed it, agreeing I'd work on it in 7.7).

@gmmorris
Copy link
Contributor Author

gmmorris commented Jan 15, 2020

The message from TM that gets logged when all the workers are in use seems ... not great.

[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration
  • it's very long
  • it's annoying, and so that will lead the customer to update max_workers, which will cause a thundering herd issue as SIEM has seen
  • I don't really have a great list of concrete alternatives, but will discuss some ideas below

ideas:

  • have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
  • have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)

I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.

We could definitely collect this and flush on a configurable interval, and in fact, much of the work to do that has already been done as we now track these events internally for the runNow api.
How would you envision describing the case of Task Manager has skipping the polling for tasks when it rans out of Available Workers? A count of how many cycles were skipped? Not sure that would be helpful....

@gmmorris
Copy link
Contributor Author

Some unorganized notes/thoughts:

  • Applying throttling in the callCluster we provide to executors seems like it makes sense

This shouldn't be too difficult, but would you envision a lossy throttle (callCluster rejects calls that are over capacity) or backpressure (accumulate requests until you have capacity, slowing down the calling executor)?

  • There's also poll_interval, which indicates how often TM looks for new work, set to 3 seconds. IIRC, there is a limit to how low you can get this to run today with the current TM architecture. Some of the alternatives architectures (in PRs we never merged) let you get the interval lower. I don't remember the numbers, but thinking I've seen sub-second with the alt-architecture PRs, but can only get about 2 seconds with the current architecture.

In my perf tests I was polling at a few 100ms, but the wall I kept hitting was that my tasks were waiting on ES, so TM polled more often than it was freeing up workers.

@gmmorris
Copy link
Contributor Author

On Slack @peterschretlen asks:

I wonder could we predict (or adjust) a good max_workers value. ~autoscaling for task manager.

I have thought about this in the past, so brain dumping some questions we'd need ways of answering to achieve this:

  1. How can Kibana know what the overall ES cluster can handle?
  2. How can Task Manager know how "heavy" a certain Task is in order to decide if it needs t oscale up/down?
  3. How do we prevent autoscaling dominos where Kibana thinks things are slow, so it bumps up max_workers causing more load that further slows the system, causing it to spin up more workers?
  4. How to we test this system? Our FT framework would not support anything like this. Should this even live in Kibana? 😆

@pmuellr
Copy link
Member

pmuellr commented Jan 15, 2020

  • Applying throttling in the callCluster we provide to executors seems like it makes sense

This shouldn't be too difficult, but would you envision a lossy throttle (callCluster rejects calls that are over capacity) or backpressure (accumulate requests until you have capacity, slowing down the calling executor)?

backpressure

@peterschretlen
Copy link
Contributor

The message from TM that gets logged when all the workers are in use seems ... not great.

[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration
  • it's very long
  • it's annoying, and so that will lead the customer to update max_workers, which will cause a thundering herd issue as SIEM has seen
  • I don't really have a great list of concrete alternatives, but will discuss some ideas below

ideas:

  • have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
  • have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)

I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.

I think this is a good measure we can take now - regardless of other measures we take, I think we can improve logging and monitoring.
Opened as #54920

@gmmorris
Copy link
Contributor Author

Following a call between Alerting & Siem, we've decided to move forward on:

  1. Investigate an Error encountered by SIEM with Api Key usage on creation of new rules. @mikecote will investigate.
  2. SIEM will investigate the source of the 429 errors they've been seeing, owned by @FrankHassanabad
  3. Alerting will look into the possibility that the heavy usage of SavedObjects might play a role in the stress on ES in SIEM's testing. @pmuellr will schedule an session to run a stress test over the alerting framework.

We discussed the need for a circuit breaker on the callCluster api provided to Alert Executors, but first we'll investigate the origin of the stress on ES.
We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana (related to #54916 but at AlertType level, rather than just TaskTypeDefinition level), @gmmorris will look into this further.

@pmuellr
Copy link
Member

pmuellr commented Jan 16, 2020

We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana

If alerts gets their own per-alertType max_workers, may want the same for actions as well. Is there already a discuss issue on this? Seems complicated, there are a bunch of things we max_worker-ize, some will be easier, others harder ... both our own implementation and users tweaking the knobs and dials.

@peterschretlen
Copy link
Contributor

I think the following test will re-create the problem, suggests that the issue is the saved object write load from alerting/task manager.

  1. in kibana.yml, set xpack.task_manager.max_workers : 1000
  2. download and setup auditbeat: ./auditbeat setup ( note you don’t actually have to run auditbeat, just run the setup to create the index and ILM policies )

If you are using a dev environment with ssl, use the following auditbeat.yml

auditbeat.modules:

- module: file_integrity
  paths:
  - /bin
  - /usr/bin
  - /usr/local/bin
  - /sbin
  - /usr/sbin
  - /usr/local/sbin

- module: system
  datasets:
    - host    # General host information, e.g. uptime, IPs
    - package # Installed, updated, and removed packages
    - process # Started and stopped processes
  state.period: 12h

setup.template.settings:
  index.number_of_shards: 1

setup.kibana:
  host: "https://localhost:5601"
  ssl:
    verification_mode: "none"

output.elasticsearch:
  hosts: ["localhost:9200"]
  protocol: "https"
  username: "elastic"
  password: "changeme"
  ssl:
    verification_mode: "none"

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  1. start Kibana and go do detection engine page: https://localhost:5601/app/siem#/detection-engine/rules
  2. activate one of the built-in rules (to create the signals index), then deactivate it.
  3. import the following ruleset: https://gist.github.com/peterschretlen/01dc466bc311cc576658031f57112d44
    (this imports 1000 rules, each making an API call from the browser so it takes a while )
  4. After the rules have imported, shut down kibana and wait 2 minutes so all the rules are overdue
  5. Restart Kibana.

Result:

You'll see a lot of entries like the following in the kibana logs about failed task updates:

server    log   [17:45:51.208] [error][plugins][taskManager][taskManager] Failed to mark Task alerting:siem.signals "75355de0-38b1-11ea-be6b-3999ff8c2ef6" as running: rejected execution of processing of [101718][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[.kibana_task_manager_1][0]] containing [index {[.kibana_task_manager][task:75355de0-38b1-11ea-be6b-3999ff8c2ef6], source[{"migrationVersion":{"task":"7.6.0"},"task":{"taskType":"alerting:siem.signals","retryAt":"2020-01-16T22:55:49.107Z","runAt":"2020-01-16T22:43:23.758Z","scope":["alerting"],"startedAt":"2020-01-16T22:45:49.001Z","state":"{\"alertInstances\":{},\"previousStartedAt\":\"2020-01-16T22:42:23.758Z\"}","params":"{\"alertId\":\"e233cf7a-efb7-4647-b8dc-a9de8934fad3\",\"spaceId\":\"default\"}","ownerId":"kibana:5b2de169-2785-441b-ae8c-186a1936b17d","scheduledAt":"2020-01-16T22:42:21.502Z","attempts":1,"status":"running"},"references":[],"updated_at":"2020-01-16T22:45:49.107Z","type":"task"}]}], target allocation id: pYZI4R-fR0-vEFIyMcw-VQ, primary term: 1 on EsThreadPoolExecutor[name = Peters-MacBook-Pro.local/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1dfe5c6b[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 25129]]: [remote_transport_exception] [Peters-MacBook-Pro.local][127.0.0.1:9300][indices:data/write/update[s]]
server    log   [17:45:51.234] [error][plugins][taskManager][taskManager] Failed to mark Task alerting:siem.signals "7534e8b0-38b1-11ea-be6b-3999ff8c2ef6" as running: rejected execution of processing of [101721][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[.kibana_task_manager_1][0]] containing [index {[.kibana_task_manager][task:7534e8b0-38b1-11ea-be6b-3999ff8c2ef6], source[{"migrationVersion":{"task":"7.6.0"},"task":{"taskType":"alerting:siem.signals","retryAt":"2020-01-16T22:55:49.107Z","runAt":"2020-01-16T22:43:23.758Z","scope":["alerting"],"startedAt":"2020-01-16T22:45:49.001Z","state":"{\"alertInstances\":{},\"previousStartedAt\":\"2020-01-16T22:42:23.758Z\"}","params":"{\"alertId\":\"f467691f-562c-4585-83cb-403f020a8b91\",\"spaceId\":\"default\"}","ownerId":"kibana:5b2de169-2785-441b-ae8c-186a1936b17d","scheduledAt":"2020-01-16T22:42:21.499Z","attempts":1,"status":"running"},"references":[],"updated_at":"2020-01-16T22:45:49.107Z","type":"task"}]}], target allocation id: pYZI4R-fR0-vEFIyMcw-VQ, primary term: 1 on EsThreadPoolExecutor[name = Peters-MacBook-Pro.local/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1dfe5c6b[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 25031]]: [remote_transport_exception] [Peters-MacBook-Pro.local][127.0.0.1:9300][indices:data/write/update[s]]

The elasticsearch thread pool stats confirm that updates are being rejected:

peterschretlen ~ $ curl -k -u elastic:changeme https://localhost:9200/_cat/thread_pool/write?v
node_name                name  active queue rejected
Peters-MacBook-Pro.local write      0     0      456

@gmmorris
Copy link
Contributor Author

We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana

If alerts gets their own per-alertType max_workers, may want the same for actions as well. Is there already a discuss issue on this? Seems complicated, there are a bunch of things we max_worker-ize, some will be easier, others harder ... both our own implementation and users tweaking the knobs and dials.

I've changed the task specific one to be both Task Types and Alert Types, as I don't feel we'd like those two to be separate solutions. Best to merge these into one discussion that takes both into account.

#54916

@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Oct 27, 2020
@gmmorris
Copy link
Contributor Author

Given the implementation of this issue can we close this issue? 🤔

@pmuellr
Copy link
Member

pmuellr commented Jan 19, 2021

Seems fine to close this, but let's wait for Frank, since I know he was interested in this.

IIRC, one of the early thoughts on this was to "spread out" the alerts so they wouldn't all be scheduled at the same time. I think this will actually occur naturally given the worker count / polling interval - only so many alerts can run at a time, and alerts are scheduled to run based on their interval at the time they LAST ran. So you should see a "clump" of alerts ready to run at the beginning, but then they should be spread out as they run, and maintain the "spread" as well over time.

So I'm curious if there's still a need to somehow do multi-alert scheduling in such a way that the alerts are scheduled to run in a non-overlap kind of fashion, which seems hard - especially since it would presumably need to take the currently scheduled tasks into account as well.

@YulNaumenko YulNaumenko added the technical debt Improvement of the software architecture and operational architecture label Mar 11, 2021
@mikecote
Copy link
Contributor

Backpressure to task manager when encountering 429 errors has been added as part of #77096. Closing the issue.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

No branches or pull requests

8 participants