[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

pmuellr · 2022-02-23T22:23:17Z

Summary

Adds new task manager configuration keys:

xpack.task_manager.event_loop_delay.monitor - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today.
xpack.task_manager.event_loop_delay.warn_threshold - the number of milliseconds of event loop delay before logging a warning

This code uses the perf_hooks.monitorEventLoopDelay() API [1] to collect the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs.

The warning messages look like this:

event loop blocked for at least 6094 ms while running task alerting:.index-threshold af9facf0-9512-11ec-aaab-571072c92061

To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is event-loop-blocked, and the other tag added is a string consisting of the task type and task id. For example:

alerting:.index-threshold af9facf0-9512-11ec-aaab-571072c92061

Viewing Kibana logs with Discover, it's easy to narrow down frequent tasks by exploring the tags field, after filtering on tags:event-loop-blocked:

The alerting:.index-threshold af9fa... task at 41% is occurring more frequently than the next alerting: task, so it's probably the event loop delay offender. You can click on the "Visualize" button to get a better view in Lens.

The initial Lens view shows that task significantly larger than the next one:

But if we customize the top-N to show 20, it's more clear this task seems to be problematic.

Checklist

Delete any items that are not applicable to this PR.

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list

pmuellr · 2022-03-04T17:53:57Z

One of the concerns I have with the current approach, is that we are using monitorEventLoopDelay() wrapped on every task execution, and it's not completely clear to me that this won't cause performance issues itself.

However, there's a fair bit of anecdotal evidence that says we won't:

it's using the chrome trace APIs under the covers, which means it's using the lowest-overhead way of doing monitoring available to v8, used for other similar perf monitoring at scale metrics
a twitter post from James Snell claiming it "can be enabled at production" - I've worked with James in the past, I know he's pretty serious about this stuff
elastic has contributed to a fastly plugin under-pressure which does this exact same thing

So, have good feels about this, now. And we will have a way to "turn it off" anyway, by setting a ~~"warning limit" to a non-positive value~~ specific config value to turn it off. (edit by pmuellr; changed from using a magic value on the warning level to a separate binary config value)

… limit resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_on_delay` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, and the other tag added is a string consisting of the task type and task id.

elasticmachine · 2022-03-09T17:22:14Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-03-09T17:22:58Z

wondering where we want to backport this to. Guessing the backport may not be that bad - 8.1.x? 7.x?

spalger

Docker config update LGTM

kobelb · 2022-03-09T17:45:07Z

wondering where we want to backport this to. Guessing the backport may not be that bad - 8.1.x? 7.x?

If we're confident this won't introduce performance degradations, let's backport to 7.x. We continue to have SDHs for 7.x where this would be super helpful.

ymao1

Looks great! Verified I was able to see the warnings logged by setting the threshold very low. Nice work

pmuellr · 2022-03-10T22:58:42Z

@elasticmachine merge upstream

pmuellr · 2022-03-14T16:35:11Z

@elasticmachine merge upstream

mikecote

Changes LGTM! Had one nit and question 👍

mikecote · 2022-03-21T18:00:20Z

docs/settings/task-manager-settings.asciidoc

+`xpack.task_manager.event_loop_delay.monitor`::
+Enables event loop delay monitoring, which will log a warning when a task causes an event loop delay which exceeds the `warn_on_delay` setting.  Defaults to true.
+
+`xpack.task_manager.event_loop_delay.warn_on_delay`::


nit: warn_on_delay made me think the value is a boolean. Would something like warn_threshold make it clearer?

names! Ya, don't like the one I picked. I think I considered threshold, but seemed weird thinking about time being the threshold value. Or maybe that was back before I split the config into two keys and the resulting key name was SOOO long. Let me give it a go, I suspect I will be happy with it now ...

changed in commit cbffd27 to warn_threshold

mikecote · 2022-03-21T18:03:35Z

x-pack/plugins/task_manager/server/task_running/task_runner.ts

+      this.logger.warn(
+        `event loop blocked for at least ${eventLoopBlockMs} ms while running task ${taskLabel}`,
+        {
+          tags: [taskLabel, 'event-loop-blocked'],


question: could we also add taskType? This should allow us to break it down by rule type indirectly I think?

So, taskType as an additional tag value? Ya, that could be useful. It's possible it could be noisy. Let me give it a go, but guessing it will be fine (and useful!)

Yeah, I was thinking just as a `tag 👍 I'm not sure of the downsides, if any.

changed in cbffd27 to add the taskType as well.

I think we'll have to see how well this works out in practice. The problem I think is going to be the extra noise of the tasktype tags, but I suspect it won't be too bad. Already when you look at the distribution of tags values, after filtering on tags:event-loop-blocked, you have to mentally "ignore" the three that are always there: kibana, logger, and event-loop-blocked - now we'll have the task types as well. But presumably would only be a handful.

Sounds good, thanks!

pmuellr · 2022-03-22T13:28:20Z

@elasticmachine merge upstream

… PR review

kibana-ci · 2022-03-22T22:24:34Z

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`taskManager`	19	20	+1

Total ESLint disabled count

id	before	after	diff
`taskManager`	19	20	+1

History

💚 Build #32618 succeeded 1167b41
💚 Build #30065 succeeded ce34e0d
💚 Build #29662 succeeded 3ed8031
💚 Build #28532 succeeded a535177
💔 Build #28216 failed eeed244
💔 Build #27033 failed cdb7251

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9)

…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id.

…configured limit (elastic#126300) resolves elastic#124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9) # Conflicts: # src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker # x-pack/plugins/task_manager/server/config.test.ts # x-pack/plugins/task_manager/server/polling_lifecycle.ts # x-pack/plugins/task_manager/server/task_running/task_runner.test.ts # x-pack/plugins/task_manager/server/task_running/task_runner.ts

…configured limit (#126300) (#128377) resolves #124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id. (cherry picked from commit b028cf9)

…n over configured limit (#126300) (#128402) resolves #124366 Adds new task manager configuration keys. - `xpack.task_manager.event_loop_delay.monitor` - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today - `xpack.task_manager.event_loop_delay.warn_threshold` - the number of milliseconds of event loop delay before logging a warning This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect the event loop delay while a task is running. [1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs. To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is `event-loop-blocked`, second is the task type, and the third is a string consisting of the task type and task id.

pmuellr added the ci:deploy-cloud label Feb 23, 2022

pmuellr force-pushed the alerting/event-loop-block branch from 8e30beb to 988a97e Compare March 1, 2022 20:23

pmuellr force-pushed the alerting/event-loop-block branch from cdb7251 to eeed244 Compare March 7, 2022 22:57

pmuellr changed the title ~~[ResponseOps] collect / publish event loop delay information~~ [ResponseOps][task manager] log event loop delay for tasks when over configured limit Mar 8, 2022

pmuellr force-pushed the alerting/event-loop-block branch from eeed244 to a535177 Compare March 8, 2022 19:22

pmuellr added Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.2.0 labels Mar 9, 2022

pmuellr marked this pull request as ready for review March 9, 2022 17:22

pmuellr requested review from a team as code owners March 9, 2022 17:22

spalger reviewed Mar 9, 2022

View reviewed changes

spalger approved these changes Mar 9, 2022

View reviewed changes

ymao1 approved these changes Mar 10, 2022

View reviewed changes

Merge branch 'main' into alerting/event-loop-block

3ed8031

Merge branch 'main' into alerting/event-loop-block

ce34e0d

mikecote approved these changes Mar 21, 2022

View reviewed changes

kibanamachine and others added 2 commits March 22, 2022 09:28

Merge branch 'main' into alerting/event-loop-block

1167b41

change warn_on_delay to warn_threshold, add taskType tag to log; from…

cbffd27

… PR review

pmuellr added the v8.1.2 label Mar 23, 2022

pmuellr added the v7.17.2 label Mar 23, 2022

pmuellr merged commit b028cf9 into elastic:main Mar 23, 2022

pmuellr mentioned this pull request Mar 23, 2022

[8.1] [ResponseOps][task manager] log event loop delay for tasks when over configured limit (#126300) #128377

Merged

pmuellr mentioned this pull request Mar 23, 2022

[7.17] [ResponseOps][task manager] log event loop delay for tasks when over configured limit (#126300) #128402

Merged

pmuellr added the backported label Mar 30, 2022

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

pmuellr commented Feb 23, 2022 •

edited

Loading

pmuellr commented Mar 4, 2022 •

edited

Loading

elasticmachine commented Mar 9, 2022

pmuellr commented Mar 9, 2022

spalger left a comment

kobelb commented Mar 9, 2022

ymao1 left a comment

pmuellr commented Mar 10, 2022

pmuellr commented Mar 14, 2022

mikecote left a comment

mikecote Mar 21, 2022

pmuellr Mar 22, 2022

pmuellr Mar 23, 2022

mikecote Mar 23, 2022

mikecote Mar 21, 2022

pmuellr Mar 22, 2022

mikecote Mar 22, 2022

pmuellr Mar 23, 2022

mikecote Mar 23, 2022

pmuellr commented Mar 22, 2022

kibana-ci commented Mar 22, 2022

ESLint disabled line counts

Total ESLint disabled count

[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

Conversation

pmuellr commented Feb 23, 2022 • edited Loading

Summary

Checklist

pmuellr commented Mar 4, 2022 • edited Loading

elasticmachine commented Mar 9, 2022

pmuellr commented Mar 9, 2022

spalger left a comment

Choose a reason for hiding this comment

kobelb commented Mar 9, 2022

ymao1 left a comment

Choose a reason for hiding this comment

pmuellr commented Mar 10, 2022

pmuellr commented Mar 14, 2022

mikecote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmuellr commented Mar 22, 2022

kibana-ci commented Mar 22, 2022

💚 Build Succeeded

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

pmuellr commented Feb 23, 2022 •

edited

Loading

pmuellr commented Mar 4, 2022 •

edited

Loading