Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047

Closed
Tracked by #115235
graphaelli opened this issue Jul 8, 2022 · 24 comments
Closed
Tracked by #115235
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature needs-refinement A reason and acceptance criteria need to be defined for this issue Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@graphaelli
Copy link
Member

graphaelli commented Jul 8, 2022

Stack version: 8.4.0-SNAPSHOT

The infrastructure view in the APM UI lists container metrics for a service like this:

image

The first issue with uptime here is that containers that were started within the timepicker window seem to be assumed to still be running. For example with container ebc4507ff1bd2ff800029c347f07d2b18d7f54179f2203d75b8e1cc7708c4078, it was indeed started 20 hours before loading this page but terminated 10 seconds later, entered crash loop backoff for 5 min and then was never heard from again. We know this because the metrics captured it:

Query and results:

GET metrics-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-24h"
            }
          }
        },
        {
          "term": {
            "container.id": "ebc4507ff1bd2ff800029c347f07d2b18d7f54179f2203d75b8e1cc7708c4078"
          }
        },
        {
          "terms": {
            "data_stream.dataset": [
              "kubernetes.state_container"
            ]
          }
        }
      ]
    }
  },
  "size": 100,
  "_source": ["@timestamp", "kubernetes.container.start_time", "kubernetes.container.status"],
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}

returns

{
  "took": 1037,
  "timed_out": false,
  "_shards": {
    "total": 876,
    "successful": 876,
    "skipped": 864,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 31,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "T5Qg24EB_46CsX7rTtbZ",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "waiting",
                "reason": "CrashLoopBackOff",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:05:35.460Z"
        },
        "sort": [
          1657238735460
        ]
      },

... snip reports every 10 seconds ...

      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "8pMc24EB_46CsX7rB3j_",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "waiting",
                "reason": "CrashLoopBackOff",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:55.460Z"
        },
        "sort": [
          1657238455460
        ]
      },
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "W5Mb24EB_46CsX7r4HTw",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "terminated",
                "reason": "Error",
                "ready": false,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:45.461Z"
        },
        "sort": [
          1657238445461
        ]
      },
      {
        "_index": ".ds-metrics-kubernetes.state_container-default-2022.07.05-000022",
        "_id": "xJMb24EB_46CsX7ruWzf",
        "_score": null,
        "_source": {
          "kubernetes": {
            "container": {
              "status": {
                "phase": "running",
                "ready": true,
                "restarts": 1271
              }
            }
          },
          "@timestamp": "2022-07-08T00:00:35.460Z"
        },
        "sort": [
          1657238435460
        ]
      }
    ]
  }
}

The second issue is also illustrated here for containers that are still running. The uptime displayed is relative to now, rather than to the times selected in the date picker. @alex-fedotyev inspired this nice visual to demonstrate the issue:

image

@graphaelli graphaelli added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Jul 8, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@dannycroft dannycroft added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jul 11, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@miltonhultgren
Copy link
Contributor

I kind of expected this, we did discuss it in the original ticket but no one had a clear answer and we had so many other questions going on so we just went with kubernetes.container/pod.start_time and never looked back.

The second issue seems easy enough to fix, rather than using Date.now() we should use the end of the timerange to calculate how long it's been running from the start until the end of the timerange (how long had it been running during that time).

The first issue is more tricky, for several reasons, but let's see what options we might have.

Quick "fix":
Is it acceptable to hide the column for the GA release ? It reduces the overall value but at least it'll be only correct data shown.
We could look to replace it with some other relevant (and easier to compute) metric, like I requested here.
@alex-fedotyev @mlunadia

Real fix:
For all containers/pods returned by our metrics API, we would need to make another query to fetch the documents needed to determine the uptime and join that with the data from the original API call (there are some limitations to the currently used API).

What does that algorithm look like, to actually determine the uptime? This is not clear to me, which documents should be involved to make this as easy as possible.

Long fix:
This highlights again the short comings of the API we're using as outlined partly in this issue and the comment I linked above (cannot use keywords, can only compute metric aggregations).

For performance reasons but also to gain flexibility in the data we show we might want to build a dedicated API for this component.
That said, I imagine that calculating uptime of Kubernetes resources is a common issue we'll face in the future so I don't know where to place such logic.

Going further, thinking about a dashboard driven future, how would a dashboard display such information?
Is it possible to alter the Kubernetes integration to report this data in a different way that would make it easier to compute, like in a Lens visualization?
@jasonrhodes

@mlunadia
Copy link

mlunadia commented Jul 12, 2022

Quick "fix":
Is it acceptable to hide the column for the GA release ? It reduces the overall value but at least it'll be only correct data shown.
We could look to replace it with some other relevant (and easier to compute) metric, like I requested #133123 (comment).

@alex-fedotyev lets sync on this, do you have a meta issue to help me understand the reasoning behind this one? I'd like to get more context on who we are solving for and what problems so that I can form a more informed opinion.

@miltonhultgren miltonhultgren changed the title [APM] Infrastructure uptime for k8s container incorrect [Infrastructure Monitoring UI] Uptime for Pod/Container in Metrics table is wrong Jul 19, 2022
@miltonhultgren miltonhultgren removed the Team:APM All issues that need APM UI Team support label Jul 19, 2022
@miltonhultgren
Copy link
Contributor

miltonhultgren commented Jul 19, 2022

As a stop gap we're going to hide this column for now and follow up on this issue with either a real fix or some other metrics.

#136596

For the follow up, it would be good if we can use metrics that are easy to consume and if we want to display uptime we should look to extend the Kubernetes module to collect that in a different way (similar to how the system module does).

@miltonhultgren miltonhultgren changed the title [Infrastructure Monitoring UI] Uptime for Pod/Container in Metrics table is wrong [Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong Jul 19, 2022
miltonhultgren added a commit that referenced this issue Jul 19, 2022
* [Infra] Hide Uptime column in Pod/Container metrics table (#136047)

* Remove code instead of commenting it out

* Fully remove uptime related code
@pmeresanu85
Copy link

@miltonhultgren @smith moved it to "in progress" judging by the latest commit reference

@matschaffer
Copy link
Contributor

@pmeresanu85 do you know who has is in progress at the moment? I'm having trouble finding anything that looks like active work on this issue.

@pmeresanu85
Copy link

@matschaffer i was referring to the previous pull from the 19th of July , when I set this issue to "in progress". Is there a reason why this has been put on hold? Is it complete?

@miltonhultgren
Copy link
Contributor

The previous PR just hid the column while we figure out what to actually show in the column and how to calculate the actual uptime.

@pmeresanu85
Copy link

@miltonhultgren i assume this means we are actively working on this, based on your statement.

@miltonhultgren
Copy link
Contributor

we = product people, yes. No engineer is working on this but this ties into the larger discussion around what we're gonna do with our multiple tables.

@matschaffer
Copy link
Contributor

Moving back to “ready”

@miltonhultgren
Copy link
Contributor

I'll put this back into refining since there isn't a clear choice of what to do.

@pmeresanu85
Copy link

@miltonhultgren I assume when you mean unclear what to do, you mean it still needs a bit of reflecting on the solution approach.

@miltonhultgren
Copy link
Contributor

Yes, I see 3 options:

  1. Invest the time into actually calculating the real uptime (requires reading the right set of documents and tracing the transitions to get the actual uptime and then joining that data into the other metrics rows)
  2. Keep it hidden and forget about it (for now, maybe we can follow a track to move this computation to the agent/ingest side to make the future easier)
  3. Replace it with some other meaningful metric that is easier to compute with what we have today

From my view, this is a decision to be made by PMs within APM.

@pmeresanu85
Copy link

@miltonhultgren thank you for sharing your thought! Let me take it into the 1:1 with @alex-fedotyev and see how to proceed on this topic.

I will update this issue after the 1:1

@pmeresanu85
Copy link

@miltonhultgren after my 1:1 with @alex-fedotyev for the Infra view in APM we agreed that:

  • Infra drives Host Centric Observability View evolution
  • APM takes what infra builds and integrates this into APM service view

As for the uptime fix, I believe we should invest into actually calculating and showing the real uptime.

@sorenlouv
Copy link
Member

sorenlouv commented Sep 19, 2022

Infra drives Host Centric Observability View evolution

Sounds good to me 👍
When we discussed this last, we decided that infra would fall back to APM collected data if a certain metric was not available in the infra metric indices. To query APM data you should use the APM Event client:

const apmEventClient = createApmEventClient({ request, context });

More info about apmEventClient: #82724

APM takes what infra builds and integrates this into APM service view

Will infra make react components available or API's? I prefer react components.

@pmeresanu85
Copy link

@sqren I prefer react components too, but I leave it up to the team to decide what is best.

@smith smith added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Sep 20, 2022
@smith
Copy link
Contributor

smith commented Sep 20, 2022

At this time putting the uptime back on the table in APM is not a huge priority for APM. Removed from "Refining" on our board and added needs-refinement label.

We probably want to see if this can be handled closer to ingest time, since calculating these on the client could be problematic.

The next best times to revisit this might be when we're adding container-oriented views to host observability, or when APM is making updates to the metrics tab and changes to how infrastructure data is displayed.

@jasonrhodes
Copy link
Member

@miltonhultgren / @neptunian when you have time (no rush), can one or both of you walk me through what's missing in current documents to calculate the uptime we want to see? Based on @alex-fedotyev's diagram in the issue description, it looks like we need "stop time" for both scenarios.

I see that @smith has de-prioritized this so don't take time away from other things to consider this, but I want to wrap my head around whether this is something we expect to be able to answer from signal documents or if it's something we should be considering from an inventory perspective, or both.

Anyway, the idea: is there a way to, based on the query parameters, look at the bucket count for documents that are returned for each node and compare it to what we expect vs. how far the start time is from the beginning of the time window? For example:

Pod data metricset.period: 5 min
Current time: 15:35
Selected time range: 1 hour (14:00 - 15:00)
Start time: 14:22
Actual stop time: 14:41
Real uptime: 19 min

We'd currently calculate that uptime as 73 min, as I understand it (from start time to now, disregarding time range end)
Small adjustment to take into account the time range end would calculate it as 38 minutes, better, but still wrong.

Could we do something like:

  • 38 minutes, calculated above, is max_uptime
  • Start time is :22, so we expect a max of 7 buckets (we'd normally expect 12 buckets for an hour, but we know we won't have :00, :05, :10, :15, or :20).
  • Query returns that we have 4 buckets for this node (:25, :30, :35, and :40)
  • missing_buckets = 7 - 4 = 3
  • Calculate uptime as a range of max_uptime - ((missing_buckets + 1) * interval) to max_uptime - (missing_buckets * interval) or 18-23 min?

The main thing I'm not sure of is whether the current queries we are doing already return the number of buckets found within the given range, and if we can easily compare that to the number of buckets we expect in that range based on the metricset.period -- if those aren't possible/simple, then this is likely too complicated to attempt.

@miltonhultgren miltonhultgren added the Feature:Metrics UI Metrics UI feature label Jun 23, 2023
@roshan-elastic
Copy link

Closing - uptime is being deprecated I believe @drewpost ?

@smith
Copy link
Contributor

smith commented Jul 11, 2023

@roshan-elastic this isn't talking about the Uptime app but about how long a container or pod has been "up".

We removed that column from the table and I don't think the underlying issue has been solved.

Since we're not showing these values in the UI at this time I don't think we need to solve this so it's ok to remain closed.

@smith smith closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2023
@roshan-elastic
Copy link

Ah thanks for the clarification @smith - I didn't realise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature needs-refinement A reason and acceptance criteria need to be defined for this issue Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests