-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Pinging @elastic/siem (Team:SIEM) |
if the bulk of the task load is (mostly) querying Elasticsearch and (occasionally) indexing signal data, then I think a good starting point would be to set A rough measure for something like query throughput is the default search thread pool size for nodes on a cluster, which is
So if you have 32 CPU nodes, you'll have 49 threads for searches. 429s/503s will be due to the throughput of the cluster: Once all the threads in a pool are used on elasticsearch, the requests are queued. If the active requests take a long time, the queue will fill up and start rejecting, which you'll see as 429 or 503 depending on the API. This assumes all tasks put load on Elasticsearch, which is not true ( actions for instance won't place much load, with the exception of indexing actions). At some point it may be worth having a different pool of workers for tasks that hit Elasticsearch. |
Some unorganized notes/thoughts:
|
This is indeed vestigial, but it's coming back in 7.7 :) |
The message from TM that gets logged when all the workers are in use seems ... not great.
ideas:
I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds. |
I have a stress test for alerts that uses some other plugins/tools I've built for heartbeats that makes essql calls - in a gist here: https://gist.github.com/pmuellr/29042ba4fd58f8e4c088d3b0a703da2e One of the interesting things running this with 100 alerts at a 1 sec interval with 10 workers, is that the action execution is about 30 seconds after the action is scheduled to run in the alert. With 1000 alerts at 1 sec interval, the delay is getting up to 2.5 minutes. I imagine because it basically gets put at the "end of the queue". That seems not great to me, wondering if action tasks should always have priority over alert tasks. That doesn't seem perfect either, for expensive actions. I sounds nice to maybe have a "priority" system here, where I could set my alerts at a lower priority than my actions, but it's complicated ™️ Catching up to master so I can change the max_workers, I ran 100 alerts at 1 sec interval with 100 workers, and it's a lot more lively in that there's on 3 seconds latency from when actions are scheduled till when they are executed. Kibana is running at 150-200% CPU, 1GB RAM, ES at 80-150% CPU, 1GB RAM. |
It seems like there are 2 throughput concerns to consider:
This creates a few problems today:
|
I feel like 10 is way too low. 100 seems about right to me. May be too big for some use cases (CPU /resource expensive alerts | actions).
|
Regarding |
Agreed, and I don't think we have any reason to keep it at 10, as I think it was only there in the first place to prevent multiple Reporting tasks from running on the same Kibana, and that doesn't actually work anymore (Reporting are aware and okayed it, agreeing I'd work on it in 7.7). |
We could definitely collect this and flush on a configurable interval, and in fact, much of the work to do that has already been done as we now track these events internally for the |
This shouldn't be too difficult, but would you envision a lossy throttle (callCluster rejects calls that are over capacity) or backpressure (accumulate requests until you have capacity, slowing down the calling executor)?
In my perf tests I was polling at a few 100ms, but the wall I kept hitting was that my tasks were waiting on ES, so TM polled more often than it was freeing up workers. |
On Slack @peterschretlen asks:
I have thought about this in the past, so brain dumping some questions we'd need ways of answering to achieve this:
|
backpressure |
I think this is a good measure we can take now - regardless of other measures we take, I think we can improve logging and monitoring. |
Following a call between Alerting & Siem, we've decided to move forward on:
We discussed the need for a circuit breaker on the |
If alerts gets their own per-alertType |
I think the following test will re-create the problem, suggests that the issue is the saved object write load from alerting/task manager.
If you are using a dev environment with ssl, use the following auditbeat.yml
Result: You'll see a lot of entries like the following in the kibana logs about failed task updates:
The elasticsearch thread pool stats confirm that updates are being rejected:
|
I've changed the task specific one to be both Task Types and Alert Types, as I don't feel we'd like those two to be separate solutions. Best to merge these into one discussion that takes both into account. |
Given the implementation of this issue can we close this issue? 🤔 |
Seems fine to close this, but let's wait for Frank, since I know he was interested in this. IIRC, one of the early thoughts on this was to "spread out" the alerts so they wouldn't all be scheduled at the same time. I think this will actually occur naturally given the worker count / polling interval - only so many alerts can run at a time, and alerts are scheduled to run based on their interval at the time they LAST ran. So you should see a "clump" of alerts ready to run at the beginning, but then they should be spread out as they run, and maintain the "spread" as well over time. So I'm curious if there's still a need to somehow do multi-alert scheduling in such a way that the alerts are scheduled to run in a non-overlap kind of fashion, which seems hard - especially since it would presumably need to take the currently scheduled tasks into account as well. |
Backpressure to task manager when encountering 429 errors has been added as part of #77096. Closing the issue. |
Following some testing by SIEM (@rylnd and @FrankHassanabad) we've encountered an issue where ES couldn't keep up with the requests made by the Alerting service.
@FrankHassanabad 's descriptions of the situation:
Questioning the validity of setting a
max_workers
thats higher than what the ES cluster can handle, @FrankHassanabad mentions that 429 too many requests might not actually mean that the cluster can't handle that many open requests, but rather that they have been throttled by ES to prevent a spike of calls flooding the system all at once (rapid fire). If that's the case, then staggering the requests made by Kibana could address this.I'm not familiar enough with ES to advise on that, perhaps someone else knows?
Open questions:
The text was updated successfully, but these errors were encountered: