Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] More detailed event loop diagnostics #134452

Open
rudolf opened this issue Jun 15, 2022 · 5 comments
Open

[Stack Monitoring] More detailed event loop diagnostics #134452

rudolf opened this issue Jun 15, 2022 · 5 comments
Assignees
Labels
Feature:Stack Monitoring performance Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Jun 15, 2022

There's two ways to get more details about Kibana's event loop:

  1. Enable debug level logs from the metrics.ops logger
    The metrics.ops logger will log the following fields:
     "process": {
         "uptime": 992,
         "memory": {
             "heap": {
                 "usedInBytes": 1110837232
             }
         },
         "eventLoopDelay": null,
         "eventLoopDelayHistogram": {
             "50": 0,
             "95": 0,
             "99": 0
         },
         "pid": 21480
     },
     "host": {
         "os": {
             "load": {
                 "1m": 4.28515625,
                 "5m": 3.58837890625,
                 "15m": 3.33642578125
             }
         }
     }
    
  2. Using stack monitoring
        "process": {
           "uptime": {
             "ms": 60006745
           },
           "event_loop_delay": {
             "ms": 10.099492275303643
           },
           "memory": {
             "resident_set_size": {
               "bytes": 391475200
             },
             "heap": {
               "size_limit": {
                 "bytes": 851443712
               },
               "total": {
                 "bytes": 277549056
               },
               "used": {
                 "bytes": 227331088
               }
             }
           }
         },
    

It would be useful if both diagnostics contained all the values in the IntervalHistogram type:

export interface IntervalHistogram {
  // The first timestamp the interval timer kicked in for collecting data points.
  fromTimestamp: string;
  // Last timestamp the interval timer kicked in for collecting data points.
  lastUpdatedAt: string;
  // The minimum recorded event loop delay.
  min: number;
  // The maximum recorded event loop delay.
  max: number;
  // The mean of the recorded event loop delays.
  mean: number;
  // The standard deviation of the recorded event loop delays.
  stddev: number;
  // An object detailing the accumulated percentile distribution.
  percentiles: {
    // 50th percentile of delays of the collected data points.
    50: number;
    // 75th percentile of delays of the collected data points.
    75: number;
    // 95th percentile of delays of the collected data points.
    95: number;
    // 99th percentile of delays of the collected data points.
    99: number;
  };
}```
@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance Feature:Stack Monitoring labels Jun 15, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Jun 15, 2022
@smith smith added the Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 label Jun 20, 2022
@smith
Copy link
Contributor

smith commented Jun 20, 2022

Putting the "Platform Observability" label on this. When we decide to add this, it may be a metric from platform observability or added to the existing Stack Monitoring app.

@pmeresanu85 pmeresanu85 changed the title More detailed event loop diagnostics [Stack Monitoring] More detailed event loop diagnostics Sep 8, 2022
@rudolf
Copy link
Contributor Author

rudolf commented Mar 2, 2023

Specifically for testing performance optimisations looking at the maximum recorded event loop delay is often more useful than the mean since a single request causing a > 1s event loop delay is a problem even if the mean over 5s would be a lot lower.

@gsoldevila
Copy link
Contributor

There does not seem to be an easy way to pinpoint the line in the code that blocked the event loop.

However, we can use our existing monitoring infrastructure to try to correlate event loop delays with ongoing / recent requests.

We can leverage the information stored in our overview cluster:

  • On one hand, we are logging event loop delays on serverless.metrics-* indices.
  • On the other hand, we are logging all requests to Kibana in our serverless-logging-*:logs-proxy* indices.

Thus, whenever we detect a substantial event loop delay on a given project, we can search the proxy logs and list the requests that were taking place around that time. If Kibana has been blocked for e.g. 10 seconds, a request must exist, which took at least 10 seconds to respond. Whilst it does not directly pinpoint the line in the code that caused the delay, it can constitute a good starting point to investigate and dispatch to the right team.

This is the goal of the newly introduced [Serverless] Event Loop Delays dashboard (see PR).

Also, in line with recent discussions, and with Rudolf's last comment, I am updating the kibana.stats.process.event_loop_delay.ms reported by Kibana on /api/status, changing it from mean to max.

gsoldevila added a commit that referenced this issue Feb 19, 2024
## Summary

Part of #134452

By using `mean` we're missing out on relevant spikes in event loop
delays.
fkanout pushed a commit to fkanout/kibana that referenced this issue Mar 4, 2024
## Summary

Part of elastic#134452

By using `mean` we're missing out on relevant spikes in event loop
delays.
@dgieselaar
Copy link
Member

want to flag that we can already correlate event loop delay with requests via server.eluMonitor.enabled: true and server.eluMonitor.logging.enabled: false. It's logged and/or added to as a label to APM transactions. See https://github.com/elastic/kibana/blob/main/packages/core/http/core-http-server-internal/src/http_server.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring performance Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants