Document prometheus metrics #2924

aledbf · 2018-08-10T13:44:18Z

HELP nginx_ingress_controller_bytes_sent The the number of bytes sent to a client
TYPE nginx_ingress_controller_bytes_sent histogram

Labels:

controller_class
controller_namespace
controller_pod
host
ingress
method
namespace
path
service
status

HELP nginx_ingress_controller_config_hash Running configuration hash actually running
TYPE nginx_ingress_controller_config_hash gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_config_last_reload_successful Whether the last configuration reload attempt was successful
TYPE nginx_ingress_controller_config_last_reload_successful gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_config_last_reload_successful_timestamp_seconds Timestamp of the last successful configuration reload.
TYPE nginx_ingress_controller_config_last_reload_successful_timestamp_seconds gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_ingress_upstream_latency_seconds Upstream service latency per Ingress
TYPE nginx_ingress_controller_ingress_upstream_latency_seconds summary

Labels:

controller_class
controller_namespace
controller_pod
ingress
namespace
service

HELP nginx_ingress_controller_nginx_process_connections current number of client connections with state {reading, writing, waiting}
TYPE nginx_ingress_controller_nginx_process_connections gauge

Labels:

controller_class
controller_namespace
controller_pod
state (reading, waiting, writing)

HELP nginx_ingress_controller_nginx_process_connections_total total number of connections with state {active, accepted, handled}
TYPE nginx_ingress_controller_nginx_process_connections_total counter

Labels:

controller_class
controller_namespace
controller_pod
state (accepted, active, handled)

HELP nginx_ingress_controller_nginx_process_cpu_seconds_total Cpu usage in seconds
TYPE nginx_ingress_controller_nginx_process_cpu_seconds_total counter

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_num_procs number of processes
TYPE nginx_ingress_controller_nginx_process_num_procs gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_oldest_start_time_seconds start time in seconds since 1970/01/01
TYPE nginx_ingress_controller_nginx_process_oldest_start_time_seconds gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_read_bytes_total number of bytes read
TYPE nginx_ingress_controller_nginx_process_read_bytes_total counter

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_requests_total total number of client requests
TYPE nginx_ingress_controller_nginx_process_requests_total counter

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_resident_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_resident_memory_bytes gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_virtual_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_virtual_memory_bytes gauge

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_nginx_process_write_bytes_total number of bytes written
TYPE nginx_ingress_controller_nginx_process_write_bytes_total counter

Labels:

controller_class
controller_namespace
controller_pod

HELP nginx_ingress_controller_request_duration_seconds The request processing time in milliseconds
TYPE nginx_ingress_controller_request_duration_seconds histogram

Labels:

controller_class
controller_namespace
controller_pod
host
ingress
method
namespace
path
service
status

HELP nginx_ingress_controller_request_size The request length (including request line, header, and request body)
TYPE nginx_ingress_controller_request_size histogram

Labels:

controller_class
controller_namespace
controller_pod
host
ingress
method
namespace
path
service
status

HELP nginx_ingress_controller_requests The total number of client requests.
TYPE nginx_ingress_controller_requests counter

Labels:

controller_class
controller_namespace
controller_pod
ingress
namespace
status

aledbf · 2018-08-10T13:45:40Z

Information about the Ingress controller POD:

controller_class
controller_namespace
controller_pod

Information about the Ingress rule:

ingress (name)
namespace
path (ingress path, not the complete URI in NGINX)
service (service name)

aledbf · 2018-08-15T16:06:56Z

Review missing nginx_upstream_requests_total metric

andor44 · 2018-08-15T16:14:12Z

Looking through the list above, nginx_ingress_controller_requests is actually pretty much what I want, and even better than the old nginx_upstream_requests_total, it seems that with this one I truly have the namespace and ingress information. With the old metrics, I had the name of the upstream which was a concatenation of <namespace>-<service>-<port>, which was tricky to handle if you had namespaces with dashes in their names.

markfermor · 2018-08-15T16:29:02Z

Just looking through this - perhaps its not available with the move away from VTS? But we appear to have:
latency, duration, size of requests, etc on a per service/upstream basis. However i'm not seeing anything about the number of requests to a service/upstream being available. I presume that's what you are planning to look at as part of #2924 (comment)

Edit ignore me - looks to be available by ingress name rather than by service
nginx_ingress_controller_requests{app="ingress-nginx-ext",controller_class="nginx-ext",controller_namespace="ingress-nginx",controller_pod="XX",exported_namespace="XX",ingress="upstream-ingress-name",instance="XX:XX",job="kubernetes-pods",kubernetes_pod_name="nginx-ingress-controller-ext-XX",namespace="ingress-nginx",pod_template_hash="XX",status="200"}

towolf · 2018-08-17T13:21:12Z

What is the difference of these two?
nginx_ingress_controller_response_size_sum
nginx_ingress_controller_bytes_sent_sum

I think they are identical? So the metrics are duplicated for no clear benefit?

And these metrics are histograms with a very high cardinality with buckets that really do not make any sense:

{le="+Inf"}	47207
{le="0.005"}	0
{le="0.05"}	0
{le="0.25"}	0
{le="2.5"}	0
{le="0.01"}	0
{le="0.025"}	0
{le="0.1"}	0
{le="0.5"}	0
{le="1"}	0
{le="10"}	0
{le="5"}

There are no fractional bytes, so all data is in the +Inf bucket. I think counting bytes in a (non-configurable) histogram makes no sense.

These particular bytes-based metrics will lead to a combinatoric explosion in Prometheus, creating too many time series, since they combine le (12 series), method (2-x series), path (possibly infinite?), and status (also possibly dozens).

So I think these should be collected as simple counters, not histograms:

nginx_ingress_controller_bytes_sent_bucket
nginx_ingress_controller_request_size_bucket
nginx_ingress_controller_response_size_bucket

I have to say the structure of the VTS metrics (after latest updates) was much better.

estahn · 2018-08-23T04:05:58Z

I'm trying to understand nginx_ingress_controller_ingress_upstream_latency_seconds_sum how it can be a negative. I would assume the request doesn't time travel ⌛️

A an explanation would be appreciated.

Also is there an average available? I only saw quantiles – which is great btw.

aledbf · 2018-08-23T12:10:34Z

@estahn this was fixed in 0.18.0 #2844

Globegitter · 2018-09-05T07:18:51Z

@aledbf Question about the metrics, we make use of the ingress annotation server snippet, to have a custom proxy_pass to a non-k8s service in certain circumstances as we are currently in a migration phase (and to a normal k8s service in the default case). Is there currently any way to see metrics for this? I.e. how many requests got proxy_passed to the default k8s service and how many through our custom snippet?

Edit: From what I have found not and it is not a big deal as now we just added a prometheus exporter to our k8s app itself, so we can monitor overall traffic to the ingress as well as the traffic that actually reached the pods.

estahn · 2018-09-06T10:54:51Z

@aledbf

I'm trying to figure out how to calculate the average for e.g. response_duration. Would this be correct?

sum(nginx_ingress_controller_response_duration_seconds_sum{ingress="$ingress"}) /
 sum(nginx_ingress_controller_response_duration_seconds_count{ingress="$ingress"})

In regards to nginx_ingress_controller_request_duration_seconds_bucket I understand that each bucket has the value of the previous bucket plus it's own. How is this being used?

towolf · 2018-09-10T11:30:39Z

@estahn You can use histograms only usefully by staggering the le label. This can be done, for instance, in a Heatmap in histogram mode in Grafana, or by transforming the histogram to percentiles using the histogram_quantile function.

Here's an example for the first case:

sum by (le)(
  increase(
    nginx_ingress_controller_request_duration_seconds_bucket{
      controller_class =~ "$controller_class",
      namespace =~ "$namespace",
      ingress =~ "$ingress"
    }[$interval]
  )
)

Here's an example for the second case:

histogram_quantile(
  0.99,
  sum by (le)(
    rate(
      nginx_ingress_controller_request_duration_seconds_bucket{
        controller_class =~ "$controller_class",
        namespace =~ "$namespace",
        ingress =~ "$ingress"
      }[$interval]
    )
  )
)

luispollo · 2018-09-24T22:10:52Z

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint? That would be useful to support Horizontal Pod Autoscaling with custom metrics, since the ingress controller is positioned ideally to collect those metrics (as opposed to having every service pod expose HTTP metrics).

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

aledbf · 2018-09-24T22:17:17Z

It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus?

Yes

In addition to nginx_ingress_controller_requests, which captures aggregate metrics at the ingress level, are there any plans to expose metrics on a per-upstream-endpoint?

The problem with this (0.16.0 contains this feature) is the explosion of metrics because of the label cardinality.

We are exploring how to enable this in a controlled way to avoid this issue.

@luispollo

luispollo · 2018-09-25T16:38:07Z

Sounds good, @aledbf. Is there a separate issue tracking this item? Thanks for the update.

towolf · 2018-09-25T18:17:02Z

IMHO metrics should work just like the most recent native Prometheus export of the VTS module works, with configurable buckets, with upstream metrics, etc.

It's just that special care has to be taken, that not all metrics have all label combinations. This will lead to DoS of the Prometheus server.

For instance, the upstreams/endpoints should probably not have all dimensions in terms of request method, request path, etc.

luispollo · 2018-09-25T21:59:39Z

P.S. @aledbf Looking at the changes in #2701 and later, it looks like the focus was in removing labels related to client information (remoteAddr, remoteUser, etc.), whereas my question was about labels identifying the target upstream pods.

In particular, there's an endpoint field from the Lua monitor that looks like it may have the info I'm after, and that is currently commented out in the labels:

ingress-nginx/internal/ingress/metric/collectors/socket.go

Lines 83 to 98 in 68357f8

    
           	requestTags = []string{ 
        
           		"host", 
        
           		"status", 
        
           		"method", 
        
           		"path", 
        
           		//		"endpoint", 
        
           		"namespace", 
        
           		"ingress", 
        
           		"service", 
        
           	} 
        
           )

It seems the cardinality of that label would only increase with the scale of your service pods, which I would hope is several orders of magnitude lower than the number of clients. Would you consider adding that label perhaps?

aledbf · 2018-09-25T22:02:16Z

Would you consider adding that label perhaps?

This is one of the labels that cause the high cardinality of metrics.

luispollo · 2018-09-25T22:03:08Z

Understood. Thanks for the quick reply.

rafaeljesus · 2018-09-27T20:09:08Z

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric?

@aledbf any thoughts?

StianOvrevage · 2018-11-29T09:27:27Z

I'll echo some of the previous comments urging per-upstream metrics.

When having a service with N endpoints and one experiences latency or request errors in the aggregate, it's really helpful to be able to drill down to a specific upstream pod when troubleshooting.

For a lot of us this dimension might grow by a few hundred per day, whilst the User, RequestIP etc are in the millions.

The best might be to have this configurable. There is a big difference in cardinality of Users, RequestIPs etc from a corporate environment to a public API for example, and the former might be very willing to pay the price for having those metrics.

k0nstantinv · 2018-12-19T10:03:57Z

Question about metrics nginx_ingress_controller_success and nginx_ingress_controller_errors. For example i have a prometheus outputs:

nginx_ingress_controller_success{class="nginx",controller_revision_hash="2071021497",instance="10.9.22.25:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4v7v6",name="nginx-ingress-lb",namespace="kube-system"}

value=15000

and

nginx_ingress_controller_errors{class="service",controller_revision_hash="3065885245",instance="10.2.2.17:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-service-f5z4g",name="nginx-ingress-lb-service",namespace="kube-system"}

vallue=1

So, what does that mean in practice?
Previous versions contained metric ingress_controller_success with a label count=reloads, like:

ingress_controller_success{count="reloads",instance="10.3.2.101:10254",job="pods",kubernetes_namespace="kube-system",kubernetes_pod_name="nginx-ingress-lb-4mzlf",name="nginx-ingress-lb"}

and it was clear. Now i have no idea what that means

fejta-bot · 2019-03-19T10:48:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

towolf · 2019-04-16T09:51:05Z

/remove-lifecycle stale

fejta-bot · 2019-07-15T10:44:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

iamNoah1 · 2022-01-18T13:06:16Z

Thanks @kd7lxl and @AndrewFarley
/close

k8s-ci-robot · 2022-01-18T13:06:37Z

@iamNoah1: Closing this issue.

In response to this:

Thanks @kd7lxl and @AndrewFarley
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

towolf · 2022-01-18T13:17:43Z

@iamNoah1 this issue is not really solved by virtue of implementing custom bucket sizes, is it? There's still the question of documenting the metrics in terms of what they all mean.

iamNoah1 · 2022-01-18T13:25:04Z

@towolf yes you are right, I was too fast closing this one. Do you think it would make sense you open a new issue so that we don't have all the clutter in the comments that distract from the actual issue?

towolf · 2022-01-18T16:26:11Z

Actually I think this issue contained some worthwhile discussion points, but I guess not much of it is actually
"actionable" as an issue, so dunno.

iamNoah1 · 2022-01-18T16:43:34Z

hmm ok, let's
/reopen

k8s-ci-robot · 2022-01-18T16:43:54Z

@iamNoah1: Reopened this issue.

In response to this:

hmm ok, let's
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iamNoah1 · 2022-01-18T16:49:31Z

/kind documentation
/help

naseemkullah · 2022-01-27T15:24:37Z

Since path is the ingress path, not the complete URI in NGINX, the cardinality is low, can we have this in ~~all metrics~~ nginx_ingress_controller_requests counter?

k8s-triage-robot · 2022-04-27T16:24:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mindw · 2022-04-28T05:19:00Z

/remove-lifecycle stale

strongjz · 2022-06-22T21:09:56Z

@nailgun looks like the project has had this open for a while. going to close #8727

nailgun · 2022-06-22T21:24:11Z

@strongjz I will do it. But we need another patch with consistent metrics naming before. I will prepare decision making doc tomorrow.

k8s-triage-robot · 2022-09-20T21:30:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

radioactive11 · 2023-09-25T12:15:00Z

Is there any way of turning the endpoint label on?

AakarshitAgarwal · 2024-02-22T18:14:44Z

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric?

@aledbf any thoughts?

anyway, we can override the value? or increase bucket value?

frittentheke · 2024-02-23T08:07:20Z

I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric?
@aledbf any thoughts?

anyway, we can override the value? or increase bucket value?

Or have this correlate to the set timeout ... because that is the real upper limit ;-)
But since this is an old issue, you might want to raise a new one.

jybp · 2024-04-17T12:22:09Z

So the 10s bucket is hardcoded and there's no way to add more?!

aledbf added area/docs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Aug 10, 2018

xp-1000 mentioned this issue Feb 26, 2019

[FEATURE REQUEST] Adding class code tag for nginx_ingress_controller_requests #3818

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2019

This was referenced Jul 2, 2019

Most nginx_ingress_controller_* metrics missing with 0.24.1 #4261

Closed

Most nginx_ingress_controller_* metrics missing with 0.24.1 #4264

Closed

k8s-ci-robot closed this as completed Jan 18, 2022

k8s-ci-robot reopened this Jan 18, 2022

k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jan 18, 2022

k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jan 18, 2022

naseemkullah mentioned this issue Jan 27, 2022

prometheus metrics: add path label to nginx_ingress_controller_requests #8200

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2022

strongjz mentioned this issue Jun 22, 2022

Prometheus metrics need documentation #8727

Closed

nailgun mentioned this issue Jun 23, 2022

Consistent prometheus metric names and documentation #8728

Merged

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022

k8s-ci-robot closed this as completed in #8728 Sep 30, 2022

Document prometheus metrics #2924

Document prometheus metrics #2924

Comments

aledbf commented Aug 10, 2018

aledbf commented Aug 10, 2018 • edited Loading

aledbf commented Aug 15, 2018

andor44 commented Aug 15, 2018

markfermor commented Aug 15, 2018 • edited Loading

towolf commented Aug 17, 2018 • edited Loading

estahn commented Aug 23, 2018

aledbf commented Aug 23, 2018 • edited Loading

Globegitter commented Sep 5, 2018 • edited Loading

estahn commented Sep 6, 2018 • edited Loading

towolf commented Sep 10, 2018 • edited Loading

luispollo commented Sep 24, 2018 • edited Loading

aledbf commented Sep 24, 2018 • edited Loading

luispollo commented Sep 25, 2018

towolf commented Sep 25, 2018 • edited Loading

luispollo commented Sep 25, 2018

aledbf commented Sep 25, 2018 • edited Loading

luispollo commented Sep 25, 2018

rafaeljesus commented Sep 27, 2018 • edited Loading

StianOvrevage commented Nov 29, 2018

k0nstantinv commented Dec 19, 2018

fejta-bot commented Mar 19, 2019

towolf commented Apr 16, 2019

fejta-bot commented Jul 15, 2019

iamNoah1 commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

towolf commented Jan 18, 2022

iamNoah1 commented Jan 18, 2022

towolf commented Jan 18, 2022

iamNoah1 commented Jan 18, 2022

k8s-ci-robot commented Jan 18, 2022

iamNoah1 commented Jan 18, 2022

naseemkullah commented Jan 27, 2022 • edited Loading

k8s-triage-robot commented Apr 27, 2022

mindw commented Apr 28, 2022

strongjz commented Jun 22, 2022

nailgun commented Jun 22, 2022 • edited Loading

k8s-triage-robot commented Sep 20, 2022

radioactive11 commented Sep 25, 2023 • edited Loading

AakarshitAgarwal commented Feb 22, 2024 • edited Loading

frittentheke commented Feb 23, 2024 • edited Loading

jybp commented Apr 17, 2024

aledbf commented Aug 10, 2018 •

edited

Loading

markfermor commented Aug 15, 2018 •

edited

Loading

towolf commented Aug 17, 2018 •

edited

Loading

aledbf commented Aug 23, 2018 •

edited

Loading

Globegitter commented Sep 5, 2018 •

edited

Loading

estahn commented Sep 6, 2018 •

edited

Loading

towolf commented Sep 10, 2018 •

edited

Loading

luispollo commented Sep 24, 2018 •

edited

Loading

aledbf commented Sep 24, 2018 •

edited

Loading

towolf commented Sep 25, 2018 •

edited

Loading

aledbf commented Sep 25, 2018 •

edited

Loading

rafaeljesus commented Sep 27, 2018 •

edited

Loading

naseemkullah commented Jan 27, 2022 •

edited

Loading

nailgun commented Jun 22, 2022 •

edited

Loading

radioactive11 commented Sep 25, 2023 •

edited

Loading

AakarshitAgarwal commented Feb 22, 2024 •

edited

Loading

frittentheke commented Feb 23, 2024 •

edited

Loading