Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document prometheus metrics #2924

Closed
aledbf opened this issue Aug 10, 2018 · 70 comments · Fixed by #8728
Closed

Document prometheus metrics #2924

aledbf opened this issue Aug 10, 2018 · 70 comments · Fixed by #8728
Labels
area/docs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@aledbf
Copy link
Member

aledbf commented Aug 10, 2018

HELP nginx_ingress_controller_bytes_sent The the number of bytes sent to a client
TYPE nginx_ingress_controller_bytes_sent histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_config_hash Running configuration hash actually running
TYPE nginx_ingress_controller_config_hash gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_config_last_reload_successful Whether the last configuration reload attempt was successful
TYPE nginx_ingress_controller_config_last_reload_successful gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_config_last_reload_successful_timestamp_seconds Timestamp of the last successful configuration reload.
TYPE nginx_ingress_controller_config_last_reload_successful_timestamp_seconds gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_ingress_upstream_latency_seconds Upstream service latency per Ingress
TYPE nginx_ingress_controller_ingress_upstream_latency_seconds summary

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • ingress
  • namespace
  • service

HELP nginx_ingress_controller_nginx_process_connections current number of client connections with state {reading, writing, waiting}
TYPE nginx_ingress_controller_nginx_process_connections gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • state (reading, waiting, writing)

HELP nginx_ingress_controller_nginx_process_connections_total total number of connections with state {active, accepted, handled}
TYPE nginx_ingress_controller_nginx_process_connections_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • state (accepted, active, handled)

HELP nginx_ingress_controller_nginx_process_cpu_seconds_total Cpu usage in seconds
TYPE nginx_ingress_controller_nginx_process_cpu_seconds_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_num_procs number of processes
TYPE nginx_ingress_controller_nginx_process_num_procs gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_oldest_start_time_seconds start time in seconds since 1970/01/01
TYPE nginx_ingress_controller_nginx_process_oldest_start_time_seconds gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_read_bytes_total number of bytes read
TYPE nginx_ingress_controller_nginx_process_read_bytes_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_requests_total total number of client requests
TYPE nginx_ingress_controller_nginx_process_requests_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_resident_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_resident_memory_bytes gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_virtual_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_virtual_memory_bytes gauge

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_nginx_process_write_bytes_total number of bytes written
TYPE nginx_ingress_controller_nginx_process_write_bytes_total counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod

HELP nginx_ingress_controller_request_duration_seconds The request processing time in milliseconds
TYPE nginx_ingress_controller_request_duration_seconds histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_request_size The request length (including request line, header, and request body)
TYPE nginx_ingress_controller_request_size histogram

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • host
  • ingress
  • method
  • namespace
  • path
  • service
  • status

HELP nginx_ingress_controller_requests The total number of client requests.
TYPE nginx_ingress_controller_requests counter

Labels:

  • controller_class
  • controller_namespace
  • controller_pod
  • ingress
  • namespace
  • status
@aledbf
Copy link
Member Author

aledbf commented Aug 10, 2018

Information about the Ingress controller POD:

  • controller_class
  • controller_namespace
  • controller_pod

Information about the Ingress rule:

  • ingress (name)
  • namespace
  • path (ingress path, not the complete URI in NGINX)
  • service (service name)

@aledbf aledbf added area/docs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Aug 10, 2018
@aledbf
Copy link
Member Author

aledbf commented Aug 15, 2018

Review missing nginx_upstream_requests_total metric

@andor44
Copy link

andor44 commented Aug 15, 2018

Looking through the list above, nginx_ingress_controller_requests is actually pretty much what I want, and even better than the old nginx_upstream_requests_total, it seems that with this one I truly have the namespace and ingress information. With the old metrics, I had the name of the upstream which was a concatenation of <namespace>-<service>-<port>, which was tricky to handle if you had namespaces with dashes in their names.

@markfermor
Copy link

markfermor commented Aug 15, 2018

Just looking through this - perhaps its not available with the move away from VTS? But we appear to have:
latency, duration, size of requests, etc on a per service/upstream basis. However i'm not seeing anything about the number of requests to a service/upstream being available. I presume that's what you are planning to look at as part of #2924 (comment)

Edit ignore me - looks to be available by ingress name rather than by service
nginx_ingress_controller_requests{app="ingress-nginx-ext",controller_class="nginx-ext",controller_namespace="ingress-nginx",controller_pod="XX",exported_namespace="XX",ingress="upstream-ingress-name",instance="XX:XX",job="kubernetes-pods",kubernetes_pod_name="nginx-ingress-controller-ext-XX",namespace="ingress-nginx",pod_template_hash="XX",status="200"}

@towolf
Copy link
Contributor

towolf commented Aug 17, 2018

What is the difference of these two?
nginx_ingress_controller_response_size_sum
nginx_ingress_controller_bytes_sent_sum

I think they are identical? So the metrics are duplicated for no clear benefit?

And these metrics are histograms with a very high cardinality with buckets that really do not make any sense:

{le="+Inf"}	47207
{le="0.005"}	0
{le="0.05"}	0
{le="0.25"}	0
{le="2.5"}	0
{le="0.01"}	0
{le="0.025"}	0
{le="0.1"}	0
{le="0.5"}	0
{le="1"}	0
{le="10"}	0
{le="5"}

There are no fractional bytes, so all data is in the +Inf bucket. I think counting bytes in a (non-configurable) histogram makes no sense.

These particular bytes-based metrics will lead to a combinatoric explosion in Prometheus, creating too many time series, since they combine le (12 series), method (2-x series), path (possibly infinite?), and status (also possibly dozens).

So I think these should be collected as simple counters, not histograms:

nginx_ingress_controller_bytes_sent_bucket
nginx_ingress_controller_request_size_bucket
nginx_ingress_controller_response_size_bucket

I have to say the structure of the VTS metrics (after latest updates) was much better.

@estahn
Copy link

estahn commented Aug 23, 2018

I'm trying to understand nginx_ingress_controller_ingress_upstream_latency_seconds_sum how it can be a negative. I would assume the request doesn't time travel ⌛️

image

A an explanation would be appreciated.

Also is there an average available? I only saw quantiles – which is great btw.

@aledbf
Copy link
Member Author

aledbf commented Aug 23, 2018

@estahn this was fixed in 0.18.0 #2844

@Globegitter
Copy link
Contributor

Globegitter commented Sep 5, 2018

@aledbf Question about the metrics, we make use of the ingress annotation server snippet, to have a custom proxy_pass to a non-k8s service in certain circumstances as we are currently in a migration phase (and to a normal k8s service in the default case). Is there currently any way to see metrics for this? I.e. how many requests got proxy_passed to the default k8s service and how many through our custom snippet?

Edit: From what I have found not and it is not a big deal as now we just added a prometheus exporter to our k8s app itself, so we can monitor overall traffic to the ingress as well as the traffic that actually reached the pods.

@estahn
Copy link

estahn commented Sep 6, 2018

@aledbf

  1. I'm trying to figure out how to calculate the average for e.g. response_duration. Would this be correct?
sum(nginx_ingress_controller_response_duration_seconds_sum{ingress="$ingress"}) /
 sum(nginx_ingress_controller_response_duration_seconds_count{ingress="$ingress"})
  1. In regards to nginx_ingress_controller_request_duration_seconds_bucket I understand that each bucket has the value of the previous bucket plus it's own. How is this being used?

@towolf
Copy link
Contributor

towolf commented Sep 10, 2018

@estahn You can use histograms only usefully by staggering the le label. This can be done, for instance, in a Heatmap in histogram mode in Grafana, or by transforming the histogram to percentiles using the histogram_quantile function.

Here's an example for the first case:

sum by (le)(
  increase(
    nginx_ingress_controller_request_duration_seconds_bucket{
      controller_class =~ "$controller_class",
      namespace =~ "$namespace",
      ingress =~ "$ingress"
    }[$interval]
  )
)

image
image
image

Here's an example for the second case:

histogram_quantile(
  0.99,
  sum by (le)(
    rate(
      nginx_ingress_controller_request_duration_seconds_bucket{
        controller_class =~ "$controller_class",
        namespace =~ "$namespace",
        ingress =~ "$ingress"
      }[$interval]
    )
  )
)

image
image

@luispollo
Copy link

luispollo commented Sep 24, 2018

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint? That would be useful to support Horizontal Pod Autoscaling with custom metrics, since the ingress controller is positioned ideally to collect those metrics (as opposed to having every service pod expose HTTP metrics).

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

@aledbf
Copy link
Member Author

aledbf commented Sep 24, 2018

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

Yes

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint?

The problem with this (0.16.0 contains this feature) is the explosion of metrics because of the label cardinality.

We are exploring how to enable this in a controlled way to avoid this issue.

@luispollo

@luispollo
Copy link

Sounds good, @aledbf. Is there a separate issue tracking this item? Thanks for the update.

@towolf
Copy link
Contributor

towolf commented Sep 25, 2018

IMHO metrics should work just like the most recent native Prometheus export of the VTS module works, with configurable buckets, with upstream metrics, etc.

It's just that special care has to be taken, that not all metrics have all label combinations. This will lead to DoS of the Prometheus server.

For instance, the upstreams/endpoints should probably not have all dimensions in terms of request method, request path, etc.

@luispollo
Copy link

P.S. @aledbf Looking at the changes in #2701 and later, it looks like the focus was in removing labels related to client information (remoteAddr, remoteUser, etc.), whereas my question was about labels identifying the target upstream pods.

In particular, there's an endpoint field from the Lua monitor that looks like it may have the info I'm after, and that is currently commented out in the labels:

requestTags = []string{
"host",
"status",
"method",
"path",
// "endpoint",
"namespace",
"ingress",
"service",
}
)

It seems the cardinality of that label would only increase with the scale of your service pods, which I would hope is several orders of magnitude lower than the number of clients. Would you consider adding that label perhaps?

@aledbf
Copy link
Member Author

aledbf commented Sep 25, 2018

Would you consider adding that label perhaps?

This is one of the labels that cause the high cardinality of metrics.

@luispollo
Copy link

Understood. Thanks for the quick reply.

@rafaeljesus
Copy link

rafaeljesus commented Sep 27, 2018

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric?
screen shot 2018-09-27 at 21 52 10

@aledbf any thoughts?

@StianOvrevage
Copy link

I'll echo some of the previous comments urging per-upstream metrics.

When having a service with N endpoints and one experiences latency or request errors in the aggregate, it's really helpful to be able to drill down to a specific upstream pod when troubleshooting.

For a lot of us this dimension might grow by a few hundred per day, whilst the User, RequestIP etc are in the millions.

The best might be to have this configurable. There is a big difference in cardinality of Users, RequestIPs etc from a corporate environment to a public API for example, and the former might be very willing to pay the price for having those metrics.

@k0nstantinv
Copy link

Question about metrics nginx_ingress_controller_success and nginx_ingress_controller_errors. For example i have a prometheus outputs:

nginx_ingress_controller_success{class="nginx",controller_revision_hash="2071021497",instance="10.9.22.25:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4v7v6",name="nginx-ingress-lb",namespace="kube-system"}

value=15000

and

nginx_ingress_controller_errors{class="service",controller_revision_hash="3065885245",instance="10.2.2.17:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-service-f5z4g",name="nginx-ingress-lb-service",namespace="kube-system"} 

vallue=1

So, what does that mean in practice?
Previous versions contained metric ingress_controller_success with a label count=reloads, like:

ingress_controller_success{count="reloads",instance="10.3.2.101:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4mzlf",name="nginx-ingress-lb"}

and it was clear. Now i have no idea what that means

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2019
@towolf
Copy link
Contributor

towolf commented Apr 16, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@iamNoah1
Copy link
Contributor

Thanks @kd7lxl and @AndrewFarley
/close

@k8s-ci-robot
Copy link
Contributor

@iamNoah1: Closing this issue.

In response to this:

Thanks @kd7lxl and @AndrewFarley
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@towolf
Copy link
Contributor

towolf commented Jan 18, 2022

@iamNoah1 this issue is not really solved by virtue of implementing custom bucket sizes, is it? There's still the question of documenting the metrics in terms of what they all mean.

@iamNoah1
Copy link
Contributor

@towolf yes you are right, I was too fast closing this one. Do you think it would make sense you open a new issue so that we don't have all the clutter in the comments that distract from the actual issue?

@towolf
Copy link
Contributor

towolf commented Jan 18, 2022

Actually I think this issue contained some worthwhile discussion points, but I guess not much of it is actually
"actionable" as an issue, so dunno.

@iamNoah1
Copy link
Contributor

hmm ok, let's
/reopen

@k8s-ci-robot
Copy link
Contributor

@iamNoah1: Reopened this issue.

In response to this:

hmm ok, let's
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 18, 2022
@k8s-ci-robot k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jan 18, 2022
@iamNoah1
Copy link
Contributor

/kind documentation
/help

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jan 18, 2022
@naseemkullah
Copy link
Contributor

naseemkullah commented Jan 27, 2022

Since path is the ingress path, not the complete URI in NGINX, the cardinality is low, can we have this in all metrics nginx_ingress_controller_requests counter?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@mindw
Copy link
Contributor

mindw commented Apr 28, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2022
@strongjz
Copy link
Member

@nailgun looks like the project has had this open for a while. going to close #8727

@nailgun
Copy link
Contributor

nailgun commented Jun 22, 2022

@strongjz I will do it. But we need another patch with consistent metrics naming before. I will prepare decision making doc tomorrow.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022
@radioactive11
Copy link

radioactive11 commented Sep 25, 2023

Is there any way of turning the endpoint label on?

@AakarshitAgarwal
Copy link

AakarshitAgarwal commented Feb 22, 2024

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric? screen shot 2018-09-27 at 21 52 10

@aledbf any thoughts?

anyway, we can override the value? or increase bucket value?

@frittentheke
Copy link

frittentheke commented Feb 23, 2024

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric? screen shot 2018-09-27 at 21 52 10
@aledbf any thoughts?

anyway, we can override the value? or increase bucket value?

Or have this correlate to the set timeout ... because that is the real upper limit ;-)
But since this is an old issue, you might want to raise a new one.

@jybp
Copy link

jybp commented Apr 17, 2024

So the 10s bucket is hardcoded and there's no way to add more?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.