[Stack Monitoring] More detailed event loop diagnostics #134452

rudolf · 2022-06-15T12:21:38Z

There's two ways to get more details about Kibana's event loop:

Enable debug level logs from the metrics.ops logger
The metrics.ops logger will log the following fields:

 "process": {
     "uptime": 992,
     "memory": {
         "heap": {
             "usedInBytes": 1110837232
         }
     },
     "eventLoopDelay": null,
     "eventLoopDelayHistogram": {
         "50": 0,
         "95": 0,
         "99": 0
     },
     "pid": 21480
 },
 "host": {
     "os": {
         "load": {
             "1m": 4.28515625,
             "5m": 3.58837890625,
             "15m": 3.33642578125
         }
     }
 }

Using stack monitoring

    "process": {
       "uptime": {
         "ms": 60006745
       },
       "event_loop_delay": {
         "ms": 10.099492275303643
       },
       "memory": {
         "resident_set_size": {
           "bytes": 391475200
         },
         "heap": {
           "size_limit": {
             "bytes": 851443712
           },
           "total": {
             "bytes": 277549056
           },
           "used": {
             "bytes": 227331088
           }
         }
       }
     },

It would be useful if both diagnostics contained all the values in the IntervalHistogram type:

export interface IntervalHistogram {
  // The first timestamp the interval timer kicked in for collecting data points.
  fromTimestamp: string;
  // Last timestamp the interval timer kicked in for collecting data points.
  lastUpdatedAt: string;
  // The minimum recorded event loop delay.
  min: number;
  // The maximum recorded event loop delay.
  max: number;
  // The mean of the recorded event loop delays.
  mean: number;
  // The standard deviation of the recorded event loop delays.
  stddev: number;
  // An object detailing the accumulated percentile distribution.
  percentiles: {
    // 50th percentile of delays of the collected data points.
    50: number;
    // 75th percentile of delays of the collected data points.
    75: number;
    // 95th percentile of delays of the collected data points.
    95: number;
    // 99th percentile of delays of the collected data points.
    99: number;
  };
}```

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-15T12:21:40Z

Pinging @elastic/kibana-core (Team:Core)

smith · 2022-06-20T13:14:51Z

Putting the "Platform Observability" label on this. When we decide to add this, it may be a metric from platform observability or added to the existing Stack Monitoring app.

rudolf · 2023-03-02T13:03:01Z

Specifically for testing performance optimisations looking at the maximum recorded event loop delay is often more useful than the mean since a single request causing a > 1s event loop delay is a problem even if the mean over 5s would be a lot lower.

gsoldevila · 2024-02-15T14:25:56Z

There does not seem to be an easy way to pinpoint the line in the code that blocked the event loop.

However, we can use our existing monitoring infrastructure to try to correlate event loop delays with ongoing / recent requests.

We can leverage the information stored in our overview cluster:

On one hand, we are logging event loop delays on serverless.metrics-* indices.
On the other hand, we are logging all requests to Kibana in our serverless-logging-*:logs-proxy* indices.

Thus, whenever we detect a substantial event loop delay on a given project, we can search the proxy logs and list the requests that were taking place around that time. If Kibana has been blocked for e.g. 10 seconds, a request must exist, which took at least 10 seconds to respond. Whilst it does not directly pinpoint the line in the code that caused the delay, it can constitute a good starting point to investigate and dispatch to the right team.

This is the goal of the newly introduced [Serverless] Event Loop Delays dashboard (see PR).

Also, in line with recent discussions, and with Rudolf's last comment, I am updating the kibana.stats.process.event_loop_delay.ms reported by Kibana on /api/status, changing it from mean to max.

## Summary Part of #134452 By using `mean` we're missing out on relevant spikes in event loop delays.

## Summary Part of elastic#134452 By using `mean` we're missing out on relevant spikes in event loop delays.

dgieselaar · 2024-09-02T07:41:32Z

want to flag that we can already correlate event loop delay with requests via server.eluMonitor.enabled: true and server.eluMonitor.logging.enabled: false. It's logged and/or added to as a label to APM transactions. See https://github.com/elastic/kibana/blob/main/packages/core/http/core-http-server-internal/src/http_server.ts

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance Feature:Stack Monitoring labels Jun 15, 2022

rudolf added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Jun 15, 2022

smith added the Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 label Jun 20, 2022

pmeresanu85 changed the title ~~More detailed event loop diagnostics~~ [Stack Monitoring] More detailed event loop diagnostics Sep 8, 2022

rudolf mentioned this issue Nov 10, 2023

[Stack monitoring] log more detailed memory information #171060

Closed

gsoldevila self-assigned this Feb 6, 2024

gsoldevila mentioned this issue Feb 15, 2024

Update exposed event_loop_delay from mean to max. #177019

Merged

gsoldevila added a commit that referenced this issue Feb 19, 2024

Update exposed event_loop_delay from mean to max. (#177019)

912dd7c

## Summary Part of #134452 By using `mean` we're missing out on relevant spikes in event loop delays.

fkanout pushed a commit to fkanout/kibana that referenced this issue Mar 4, 2024

Update exposed event_loop_delay from mean to max. (elastic#177019)

dabad48

## Summary Part of elastic#134452 By using `mean` we're missing out on relevant spikes in event loop delays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] More detailed event loop diagnostics #134452

[Stack Monitoring] More detailed event loop diagnostics #134452

rudolf commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

smith commented Jun 20, 2022

rudolf commented Mar 2, 2023

gsoldevila commented Feb 15, 2024

dgieselaar commented Sep 2, 2024

[Stack Monitoring] More detailed event loop diagnostics #134452

[Stack Monitoring] More detailed event loop diagnostics #134452

Comments

rudolf commented Jun 15, 2022

elasticmachine commented Jun 15, 2022

smith commented Jun 20, 2022

rudolf commented Mar 2, 2023

gsoldevila commented Feb 15, 2024

dgieselaar commented Sep 2, 2024