[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

graphaelli · 2022-07-08T21:02:39Z

Stack version: 8.4.0-SNAPSHOT

The infrastructure view in the APM UI lists container metrics for a service like this:

The first issue with uptime here is that containers that were started within the timepicker window seem to be assumed to still be running. For example with container ebc4507ff1bd2ff800029c347f07d2b18d7f54179f2203d75b8e1cc7708c4078, it was indeed started 20 hours before loading this page but terminated 10 seconds later, entered crash loop backoff for 5 min and then was never heard from again. We know this because the metrics captured it:

Query and results:

GET metrics-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-24h"
            }
          }
        },
        {
          "term": {
            "container.id": "ebc4507ff1bd2ff800029c347f07d2b18d7f54179f2203d75b8e1cc7708c4078"
          }
        },
        {
          "terms": {
            "data_stream.dataset": [
              "kubernetes.state_container"
            ]
          }
        }
      ]
    }
  },
  "size": 100,
  "_source": ["@timestamp", "kubernetes.container.start_time", "kubernetes.container.status"],
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}

returns

{
  "took": 1037,
  "timed_out": false,
  "_shards": {
    "total": 876,
    "successful": 876,
    "skipped": 864,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 31,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "T5Qg24EB_46CsX7rTtbZ",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "waiting",
                "reason": "CrashLoopBackOff",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:05:35.460Z"
        },
        "sort": [
          1657238735460
        ]
      },

... snip reports every 10 seconds ...

      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "8pMc24EB_46CsX7rB3j_",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "waiting",
                "reason": "CrashLoopBackOff",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:55.460Z"
        },
        "sort": [
          1657238455460
        ]
      },
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "W5Mb24EB_46CsX7r4HTw",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "terminated",
                "reason": "Error",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:45.461Z"
        },
        "sort": [
          1657238445461
        ]
      },
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "xJMb24EB_46CsX7ruWzf",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "running",
                "ready": true,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:35.460Z"
        },
        "sort": [
          1657238435460
        ]
      }
    ]
  }
}

The second issue is also illustrated here for containers that are still running. The uptime displayed is relative to now, rather than to the times selected in the date picker. @alex-fedotyev inspired this nice visual to demonstrate the issue:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-07-08T21:02:41Z

Pinging @elastic/apm-ui (Team:apm)

elasticmachine · 2022-07-11T13:23:13Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

miltonhultgren · 2022-07-12T09:30:22Z

I kind of expected this, we did discuss it in the original ticket but no one had a clear answer and we had so many other questions going on so we just went with kubernetes.container/pod.start_time and never looked back.

The second issue seems easy enough to fix, rather than using Date.now() we should use the end of the timerange to calculate how long it's been running from the start until the end of the timerange (how long had it been running during that time).

The first issue is more tricky, for several reasons, but let's see what options we might have.

Quick "fix":
Is it acceptable to hide the column for the GA release ? It reduces the overall value but at least it'll be only correct data shown.
We could look to replace it with some other relevant (and easier to compute) metric, like I requested here.
@alex-fedotyev @mlunadia

Real fix:
For all containers/pods returned by our metrics API, we would need to make another query to fetch the documents needed to determine the uptime and join that with the data from the original API call (there are some limitations to the currently used API).

What does that algorithm look like, to actually determine the uptime? This is not clear to me, which documents should be involved to make this as easy as possible.

Long fix:
This highlights again the short comings of the API we're using as outlined partly in this issue and the comment I linked above (cannot use keywords, can only compute metric aggregations).

For performance reasons but also to gain flexibility in the data we show we might want to build a dedicated API for this component.
That said, I imagine that calculating uptime of Kubernetes resources is a common issue we'll face in the future so I don't know where to place such logic.

Going further, thinking about a dashboard driven future, how would a dashboard display such information?
Is it possible to alter the Kubernetes integration to report this data in a different way that would make it easier to compute, like in a Lens visualization?
@jasonrhodes

mlunadia · 2022-07-12T09:37:05Z

Quick "fix":
Is it acceptable to hide the column for the GA release ? It reduces the overall value but at least it'll be only correct data shown.
We could look to replace it with some other relevant (and easier to compute) metric, like I requested #133123 (comment).

@alex-fedotyev lets sync on this, do you have a meta issue to help me understand the reasoning behind this one? I'd like to get more context on who we are solving for and what problems so that I can form a more informed opinion.

miltonhultgren · 2022-07-19T07:02:03Z

As a stop gap we're going to hide this column for now and follow up on this issue with either a real fix or some other metrics.

#136596

For the follow up, it would be good if we can use metrics that are easy to consume and if we want to display uptime we should look to extend the Kubernetes module to collect that in a different way (similar to how the system module does).

…6047)

* [Infra] Hide Uptime column in Pod/Container metrics table (#136047) * Remove code instead of commenting it out * Fully remove uptime related code

pmeresanu85 · 2022-08-31T14:30:03Z

@miltonhultgren @smith moved it to "in progress" judging by the latest commit reference

matschaffer · 2022-08-31T22:10:53Z

@pmeresanu85 do you know who has is in progress at the moment? I'm having trouble finding anything that looks like active work on this issue.

pmeresanu85 · 2022-09-01T07:28:30Z

@matschaffer i was referring to the previous pull from the 19th of July , when I set this issue to "in progress". Is there a reason why this has been put on hold? Is it complete?

miltonhultgren · 2022-09-01T16:04:46Z

The previous PR just hid the column while we figure out what to actually show in the column and how to calculate the actual uptime.

pmeresanu85 · 2022-09-02T06:48:39Z

@miltonhultgren i assume this means we are actively working on this, based on your statement.

miltonhultgren · 2022-09-02T13:02:09Z

we = product people, yes. No engineer is working on this but this ties into the larger discussion around what we're gonna do with our multiple tables.

matschaffer · 2022-09-05T01:08:47Z

Moving back to “ready”

miltonhultgren · 2022-09-05T14:02:08Z

I'll put this back into refining since there isn't a clear choice of what to do.

pmeresanu85 · 2022-09-06T15:07:26Z

@miltonhultgren I assume when you mean unclear what to do, you mean it still needs a bit of reflecting on the solution approach.

miltonhultgren · 2022-09-06T15:26:51Z

Yes, I see 3 options:

Invest the time into actually calculating the real uptime (requires reading the right set of documents and tracing the transitions to get the actual uptime and then joining that data into the other metrics rows)
Keep it hidden and forget about it (for now, maybe we can follow a track to move this computation to the agent/ingest side to make the future easier)
Replace it with some other meaningful metric that is easier to compute with what we have today

From my view, this is a decision to be made by PMs within APM.

pmeresanu85 · 2022-09-06T15:42:26Z

@miltonhultgren thank you for sharing your thought! Let me take it into the 1:1 with @alex-fedotyev and see how to proceed on this topic.

I will update this issue after the 1:1

pmeresanu85 · 2022-09-19T08:48:44Z

@miltonhultgren after my 1:1 with @alex-fedotyev for the Infra view in APM we agreed that:

Infra drives Host Centric Observability View evolution
APM takes what infra builds and integrates this into APM service view

As for the uptime fix, I believe we should invest into actually calculating and showing the real uptime.

sorenlouv · 2022-09-19T21:27:33Z

Infra drives Host Centric Observability View evolution

Sounds good to me 👍
When we discussed this last, we decided that infra would fall back to APM collected data if a certain metric was not available in the infra metric indices. To query APM data you should use the APM Event client:

const apmEventClient = createApmEventClient({ request, context });

More info about apmEventClient: #82724

APM takes what infra builds and integrates this into APM service view

Will infra make react components available or API's? I prefer react components.

pmeresanu85 · 2022-09-20T07:33:20Z

@sqren I prefer react components too, but I leave it up to the team to decide what is best.

smith · 2022-09-20T14:45:31Z

At this time putting the uptime back on the table in APM is not a huge priority for APM. Removed from "Refining" on our board and added needs-refinement label.

We probably want to see if this can be handled closer to ingest time, since calculating these on the client could be problematic.

The next best times to revisit this might be when we're adding container-oriented views to host observability, or when APM is making updates to the metrics tab and changes to how infrastructure data is displayed.

jasonrhodes · 2022-10-19T15:34:16Z

@miltonhultgren / @neptunian when you have time (no rush), can one or both of you walk me through what's missing in current documents to calculate the uptime we want to see? Based on @alex-fedotyev's diagram in the issue description, it looks like we need "stop time" for both scenarios.

I see that @smith has de-prioritized this so don't take time away from other things to consider this, but I want to wrap my head around whether this is something we expect to be able to answer from signal documents or if it's something we should be considering from an inventory perspective, or both.

Anyway, the idea: is there a way to, based on the query parameters, look at the bucket count for documents that are returned for each node and compare it to what we expect vs. how far the start time is from the beginning of the time window? For example:

Pod data metricset.period: 5 min
Current time: 15:35
Selected time range: 1 hour (14:00 - 15:00)
Start time: 14:22
Actual stop time: 14:41
Real uptime: 19 min

We'd currently calculate that uptime as 73 min, as I understand it (from start time to now, disregarding time range end)
Small adjustment to take into account the time range end would calculate it as 38 minutes, better, but still wrong.

Could we do something like:

38 minutes, calculated above, is max_uptime
Start time is :22, so we expect a max of 7 buckets (we'd normally expect 12 buckets for an hour, but we know we won't have :00, :05, :10, :15, or :20).
Query returns that we have 4 buckets for this node (:25, :30, :35, and :40)
missing_buckets = 7 - 4 = 3
Calculate uptime as a range of max_uptime - ((missing_buckets + 1) * interval) to max_uptime - (missing_buckets * interval) or 18-23 min?

The main thing I'm not sure of is whether the current queries we are doing already return the number of buckets found within the given range, and if we can easily compare that to the number of buckets we expect in that range based on the metricset.period -- if those aren't possible/simple, then this is likely too complicated to attempt.

roshan-elastic · 2023-07-11T08:49:43Z

Closing - uptime is being deprecated I believe @drewpost ?

smith · 2023-07-11T15:46:26Z

@roshan-elastic this isn't talking about the Uptime app but about how long a container or pod has been "up".

We removed that column from the table and I don't think the underlying issue has been solved.

Since we're not showing these values in the UI at this time I don't think we need to solve this so it's ok to remain closed.

roshan-elastic · 2023-07-11T17:06:54Z

Ah thanks for the clarification @smith - I didn't realise.

graphaelli added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Jul 8, 2022

dannycroft added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jul 11, 2022

miltonhultgren mentioned this issue Jul 19, 2022

[Epic] Create re-usable component for infrastructure metrics #115235

Closed

20 tasks

miltonhultgren changed the title ~~[APM] Infrastructure uptime for k8s container incorrect~~ [Infrastructure Monitoring UI] Uptime for Pod/Container in Metrics table is wrong Jul 19, 2022

miltonhultgren removed the Team:APM All issues that need APM UI Team support label Jul 19, 2022

miltonhultgren changed the title ~~[Infrastructure Monitoring UI] Uptime for Pod/Container in Metrics table is wrong~~ [Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong Jul 19, 2022

miltonhultgren added a commit to miltonhultgren/kibana that referenced this issue Jul 19, 2022

[Infra] Hide Uptime column in Pod/Container metrics table (elastic#13…

9b34f15

…6047)

miltonhultgren mentioned this issue Jul 19, 2022

[Infra] Remove Uptime column in Pod/Container metrics table #136596

Merged

miltonhultgren mentioned this issue Sep 9, 2022

[Infra Monitoring UI] Pod/Container tables should report percentage of memory usage #133123

Closed

smith added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Sep 20, 2022

miltonhultgren added the Feature:Metrics UI Metrics UI feature label Jun 23, 2023

roshan-elastic closed this as completed Jul 11, 2023

smith closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

graphaelli commented Jul 8, 2022 •

edited

Loading

elasticmachine commented Jul 8, 2022

elasticmachine commented Jul 11, 2022

miltonhultgren commented Jul 12, 2022

mlunadia commented Jul 12, 2022 •

edited

Loading

miltonhultgren commented Jul 19, 2022 •

edited

Loading

pmeresanu85 commented Aug 31, 2022

matschaffer commented Aug 31, 2022

pmeresanu85 commented Sep 1, 2022

miltonhultgren commented Sep 1, 2022

pmeresanu85 commented Sep 2, 2022

miltonhultgren commented Sep 2, 2022

matschaffer commented Sep 5, 2022

miltonhultgren commented Sep 5, 2022

pmeresanu85 commented Sep 6, 2022

miltonhultgren commented Sep 6, 2022

pmeresanu85 commented Sep 6, 2022

pmeresanu85 commented Sep 19, 2022

sorenlouv commented Sep 19, 2022 •

edited

Loading

pmeresanu85 commented Sep 20, 2022

smith commented Sep 20, 2022

jasonrhodes commented Oct 19, 2022

roshan-elastic commented Jul 11, 2023

smith commented Jul 11, 2023

roshan-elastic commented Jul 11, 2023

[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

Comments

graphaelli commented Jul 8, 2022 • edited Loading

elasticmachine commented Jul 8, 2022

elasticmachine commented Jul 11, 2022

miltonhultgren commented Jul 12, 2022

mlunadia commented Jul 12, 2022 • edited Loading

miltonhultgren commented Jul 19, 2022 • edited Loading

pmeresanu85 commented Aug 31, 2022

matschaffer commented Aug 31, 2022

pmeresanu85 commented Sep 1, 2022

miltonhultgren commented Sep 1, 2022

pmeresanu85 commented Sep 2, 2022

miltonhultgren commented Sep 2, 2022

matschaffer commented Sep 5, 2022

miltonhultgren commented Sep 5, 2022

pmeresanu85 commented Sep 6, 2022

miltonhultgren commented Sep 6, 2022

pmeresanu85 commented Sep 6, 2022

pmeresanu85 commented Sep 19, 2022

sorenlouv commented Sep 19, 2022 • edited Loading

pmeresanu85 commented Sep 20, 2022

smith commented Sep 20, 2022

jasonrhodes commented Oct 19, 2022

roshan-elastic commented Jul 11, 2023

smith commented Jul 11, 2023

roshan-elastic commented Jul 11, 2023

graphaelli commented Jul 8, 2022 •

edited

Loading

mlunadia commented Jul 12, 2022 •

edited

Loading

miltonhultgren commented Jul 19, 2022 •

edited

Loading

sorenlouv commented Sep 19, 2022 •

edited

Loading