Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][task manager] log event loop delay for tasks when over configured limit #126300

Merged
merged 5 commits into from
Mar 23, 2022

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Feb 23, 2022

Summary

resolves #124366

Adds new task manager configuration keys:

  • xpack.task_manager.event_loop_delay.monitor - whether to monitor event loop delay or not; added in case this specific monitoring causes other issues and we'd want to disable it. We don't know of any cases where we'd need this today.

  • xpack.task_manager.event_loop_delay.warn_threshold - the number of milliseconds of event loop delay before logging a warning

This code uses the perf_hooks.monitorEventLoopDelay() API [1] to collect the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely that other tasks running at the same time will be affected, and so will also end up having a long event loop delay value, and warnings will be logged on those. Over time, though, tasks which have consistently long event loop delays will outnumber those unfortunate peer tasks, and be obvious from the volume in the logs.

The warning messages look like this:

event loop blocked for at least 6094 ms while running task alerting:.index-threshold af9facf0-9512-11ec-aaab-571072c92061

To make it a bit easier to find these when viewing Kibana logs in Discover, tags are added to the logged messages to make it easier to find them. One tag is event-loop-blocked, and the other tag added is a string consisting of the task type and task id. For example:

alerting:.index-threshold af9facf0-9512-11ec-aaab-571072c92061

Viewing Kibana logs with Discover, it's easy to narrow down frequent tasks by exploring the tags field, after filtering on tags:event-loop-blocked:

image

The alerting:.index-threshold af9fa... task at 41% is occurring more frequently than the next alerting: task, so it's probably the event loop delay offender. You can click on the "Visualize" button to get a better view in Lens.

The initial Lens view shows that task significantly larger than the next one:

image

But if we customize the top-N to show 20, it's more clear this task seems to be problematic.

image

Checklist

Delete any items that are not applicable to this PR.

  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list

@pmuellr
Copy link
Member Author

pmuellr commented Mar 4, 2022

One of the concerns I have with the current approach, is that we are using monitorEventLoopDelay() wrapped on every task execution, and it's not completely clear to me that this won't cause performance issues itself.

However, there's a fair bit of anecdotal evidence that says we won't:

So, have good feels about this, now. And we will have a way to "turn it off" anyway, by setting a "warning limit" to a non-positive value specific config value to turn it off. (edit by pmuellr; changed from using a magic value on the warning level to a separate binary config value)

@pmuellr pmuellr changed the title [ResponseOps] collect / publish event loop delay information [ResponseOps][task manager] log event loop delay for tasks when over configured limit Mar 8, 2022
… limit

resolves elastic#124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_on_delay` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, and the other tag added is a string consisting
of the task type and task id.
@pmuellr pmuellr added Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.2.0 labels Mar 9, 2022
@pmuellr pmuellr marked this pull request as ready for review March 9, 2022 17:22
@pmuellr pmuellr requested review from a team as code owners March 9, 2022 17:22
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Mar 9, 2022

wondering where we want to backport this to. Guessing the backport may not be that bad - 8.1.x? 7.x?

Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker config update LGTM

@kobelb
Copy link
Contributor

kobelb commented Mar 9, 2022

wondering where we want to backport this to. Guessing the backport may not be that bad - 8.1.x? 7.x?

If we're confident this won't introduce performance degradations, let's backport to 7.x. We continue to have SDHs for 7.x where this would be super helpful.

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Verified I was able to see the warnings logged by setting the threshold very low. Nice work

@pmuellr
Copy link
Member Author

pmuellr commented Mar 10, 2022

@elasticmachine merge upstream

@pmuellr
Copy link
Member Author

pmuellr commented Mar 14, 2022

@elasticmachine merge upstream

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! Had one nit and question 👍

`xpack.task_manager.event_loop_delay.monitor`::
Enables event loop delay monitoring, which will log a warning when a task causes an event loop delay which exceeds the `warn_on_delay` setting. Defaults to true.

`xpack.task_manager.event_loop_delay.warn_on_delay`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: warn_on_delay made me think the value is a boolean. Would something like warn_threshold make it clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names! Ya, don't like the one I picked. I think I considered threshold, but seemed weird thinking about time being the threshold value. Or maybe that was back before I split the config into two keys and the resulting key name was SOOO long. Let me give it a go, I suspect I will be happy with it now ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in commit cbffd27 to warn_threshold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

this.logger.warn(
`event loop blocked for at least ${eventLoopBlockMs} ms while running task ${taskLabel}`,
{
tags: [taskLabel, 'event-loop-blocked'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: could we also add taskType? This should allow us to break it down by rule type indirectly I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, taskType as an additional tag value? Ya, that could be useful. It's possible it could be noisy. Let me give it a go, but guessing it will be fine (and useful!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking just as a `tag 👍 I'm not sure of the downsides, if any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in cbffd27 to add the taskType as well.

I think we'll have to see how well this works out in practice. The problem I think is going to be the extra noise of the tasktype tags, but I suspect it won't be too bad. Already when you look at the distribution of tags values, after filtering on tags:event-loop-blocked, you have to mentally "ignore" the three that are always there: kibana, logger, and event-loop-blocked - now we'll have the task types as well. But presumably would only be a handful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

@pmuellr
Copy link
Member Author

pmuellr commented Mar 22, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
taskManager 19 20 +1

Total ESLint disabled count

id before after diff
taskManager 19 20 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pmuellr pmuellr added the v8.1.2 label Mar 23, 2022
@pmuellr pmuellr merged commit b028cf9 into elastic:main Mar 23, 2022
pmuellr added a commit to pmuellr/kibana that referenced this pull request Mar 23, 2022
…configured limit (elastic#126300)

resolves elastic#124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_threshold` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, second is the task type, and the third is a string
consisting of the task type and task id.

(cherry picked from commit b028cf9)
patrykkopycinski pushed a commit to patrykkopycinski/kibana that referenced this pull request Mar 23, 2022
…configured limit (elastic#126300)

resolves elastic#124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_threshold` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, second is the task type, and the third is a string 
consisting of the task type and task id.
pmuellr added a commit to pmuellr/kibana that referenced this pull request Mar 23, 2022
…configured limit (elastic#126300)

resolves elastic#124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_threshold` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, second is the task type, and the third is a string
consisting of the task type and task id.

(cherry picked from commit b028cf9)

# Conflicts:
#	src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker
#	x-pack/plugins/task_manager/server/config.test.ts
#	x-pack/plugins/task_manager/server/polling_lifecycle.ts
#	x-pack/plugins/task_manager/server/task_running/task_runner.test.ts
#	x-pack/plugins/task_manager/server/task_running/task_runner.ts
pmuellr added a commit that referenced this pull request Mar 23, 2022
…configured limit (#126300) (#128377)

resolves #124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_threshold` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, second is the task type, and the third is a string
consisting of the task type and task id.

(cherry picked from commit b028cf9)
pmuellr added a commit that referenced this pull request Mar 23, 2022
…n over configured limit (#126300) (#128402)

resolves #124366

Adds new task manager configuration keys.

- `xpack.task_manager.event_loop_delay.monitor` - whether to monitor
  event loop delay or not; added in case this specific monitoring
  causes other issues and we'd want to disable it.  We don't know
  of any cases where we'd need this today

- `xpack.task_manager.event_loop_delay.warn_threshold` - the number
  of milliseconds of event loop delay before logging a warning

This code uses the `perf_hooks.monitorEventLoopDelay()` API[1] to collect
the event loop delay while a task is running.

[1] https://nodejs.org/api/perf_hooks.html#perf_hooksmonitoreventloopdelayoptions

When a significant event loop delay is encountered, it's very likely
that other tasks running at the same time will be affected, and so
will also end up having a long event loop delay value, and warnings
will be logged on those.  Over time, though, tasks which have consistently
long event loop delays will outnumber those unfortunate peer tasks, and
be obvious from the volume in the logs.

To make it a bit easier to find these when viewing Kibana logs in Discover,
tags are added to the logged messages to make it easier to find them.  One
tag is `event-loop-blocked`, second is the task type, and the third is a string
consisting of the task type and task id.
@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported ci:cloud-deploy Create or update a Cloud deployment Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.17.2 v8.1.2 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps] research collecting / exporting event loop delay info
9 participants