-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300
Conversation
8e30beb
to
988a97e
Compare
One of the concerns I have with the current approach, is that we are using However, there's a fair bit of anecdotal evidence that says we won't:
So, have good feels about this, now. And we will have a way to "turn it off" anyway, by setting a |
cdb7251
to
eeed244
Compare
… limit resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_on_delay` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, and the other tag added is a string consisting of the task type and task id.
eeed244
to
a535177
Compare
Pinging @elastic/response-ops (Team:ResponseOps) |
wondering where we want to backport this to. Guessing the backport may not be that bad - 8.1.x? 7.x? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker config update LGTM
If we're confident this won't introduce performance degradations, let's backport to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Verified I was able to see the warnings logged by setting the threshold very low. Nice work
@elasticmachine merge upstream |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM! Had one nit and question 👍
`xpack.task_manager.event_loop_delay.monitor`:: | ||
Enables event loop delay monitoring, which will log a warning when a task causes an event loop delay which exceeds the `warn_on_delay` setting. Defaults to true. | ||
|
||
`xpack.task_manager.event_loop_delay.warn_on_delay`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: warn_on_delay
made me think the value is a boolean. Would something like warn_threshold
make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names! Ya, don't like the one I picked. I think I considered threshold, but seemed weird thinking about time being the threshold value. Or maybe that was back before I split the config into two keys and the resulting key name was SOOO long. Let me give it a go, I suspect I will be happy with it now ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in commit cbffd27 to warn_threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
this.logger.warn( | ||
`event loop blocked for at least ${eventLoopBlockMs} ms while running task ${taskLabel}`, | ||
{ | ||
tags: [taskLabel, 'event-loop-blocked'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: could we also add taskType
? This should allow us to break it down by rule type indirectly I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, taskType
as an additional tag value? Ya, that could be useful. It's possible it could be noisy. Let me give it a go, but guessing it will be fine (and useful!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking just as a `tag 👍 I'm not sure of the downsides, if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in cbffd27 to add the taskType as well.
I think we'll have to see how well this works out in practice. The problem I think is going to be the extra noise of the tasktype tags, but I suspect it won't be too bad. Already when you look at the distribution of tags
values, after filtering on tags:event-loop-blocked
, you have to mentally "ignore" the three that are always there: kibana
, logger
, and event-loop-blocked
- now we'll have the task types as well. But presumably would only be a handful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]Unknown metric groupsESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9)
…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id.
…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9) # Conflicts: # src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker # x-pack/plugins/task_manager/server/config.test.ts # x-pack/plugins/task_manager/server/polling_lifecycle.ts # x-pack/plugins/task_manager/server/task_running/task_runner.test.ts # x-pack/plugins/task_manager/server/task_running/task_runner.ts
…configured limit (#126300) (#128377) resolves #124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9)
…n over configured limit (#126300) (#128402) resolves #124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id.
Summary
resolves #124366
Adds new task manager configuration keys:
xpack.task_manager.event_loop_delay.monitor
- whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today.xpack.task_manager.event_loop_delay.warn_threshold
- the number of milliseconds of event loop delay before logging a warningThis code uses the
perf_hooks.monitorEventLoopDelay()
API [1] to collect the event loop delay while a task is running.[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions
When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs.
The warning messages look like this:
To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is
event-loop-blocked
, and the other tag added is a string consisting of the task type and task id. For example:Viewing Kibana logs with Discover, it's easy to narrow down frequent tasks by exploring the
tags
field, after filtering ontags:event-loop-blocked
:The
alerting:.index-threshold af9fa...
task at 41% is occurring more frequently than the nextalerting:
task, so it's probably the event loop delay offender. You can click on the "Visualize" button to get a better view in Lens.The initial Lens view shows that task significantly larger than the next one:
But if we customize the top-N to show 20, it's more clear this task seems to be problematic.
Checklist
Delete any items that are not applicable to this PR.