-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document prometheus metrics #2924
Comments
Information about the Ingress controller POD:
Information about the Ingress rule:
|
Review missing nginx_upstream_requests_total metric |
Looking through the list above, |
Just looking through this - perhaps its not available with the move away from VTS? But we appear to have: Edit ignore me - looks to be available by ingress name rather than by service |
What is the difference of these two? I think they are identical? So the metrics are duplicated for no clear benefit? And these metrics are histograms with a very high cardinality with buckets that really do not make any sense:
There are no fractional bytes, so all data is in the These particular bytes-based metrics will lead to a combinatoric explosion in Prometheus, creating too many time series, since they combine So I think these should be collected as simple counters, not histograms:
I have to say the structure of the VTS metrics (after latest updates) was much better. |
@aledbf Question about the metrics, we make use of the ingress annotation server snippet, to have a custom Edit: From what I have found not and it is not a big deal as now we just added a prometheus exporter to our k8s app itself, so we can monitor overall traffic to the ingress as well as the traffic that actually reached the pods. |
|
@estahn You can use histograms only usefully by staggering the Here's an example for the first case:
Here's an example for the second case:
|
In addition to It looks as though the Lua monitoring may already collect those metrics but they're just not being exposed to Prometheus? |
Yes
The problem with this (0.16.0 contains this feature) is the explosion of metrics because of the label cardinality. We are exploring how to enable this in a controlled way to avoid this issue. |
Sounds good, @aledbf. Is there a separate issue tracking this item? Thanks for the update. |
IMHO metrics should work just like the most recent native Prometheus export of the VTS module works, with configurable buckets, with upstream metrics, etc. It's just that special care has to be taken, that not all metrics have all label combinations. This will lead to DoS of the Prometheus server. For instance, the upstreams/endpoints should probably not have all dimensions in terms of request method, request path, etc. |
P.S. @aledbf Looking at the changes in #2701 and later, it looks like the focus was in removing labels related to client information ( In particular, there's an ingress-nginx/internal/ingress/metric/collectors/socket.go Lines 83 to 98 in 68357f8
It seems the cardinality of that label would only increase with the scale of your service pods, which I would hope is several orders of magnitude lower than the number of clients. Would you consider adding that label perhaps? |
This is one of the labels that cause the high cardinality of metrics. |
Understood. Thanks for the quick reply. |
I am having a max of 10s in my request latency, I've noticed the max bucket is 10s, can we have more bucket values to latency metric? @aledbf any thoughts? |
I'll echo some of the previous comments urging per-upstream metrics. When having a service with N endpoints and one experiences latency or request errors in the aggregate, it's really helpful to be able to drill down to a specific upstream pod when troubleshooting. For a lot of us this dimension might grow by a few hundred per day, whilst the User, RequestIP etc are in the millions. The best might be to have this configurable. There is a big difference in cardinality of Users, RequestIPs etc from a corporate environment to a public API for example, and the former might be very willing to pay the price for having those metrics. |
Question about metrics
and
So, what does that mean in practice?
and it was clear. Now i have no idea what that means |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Thanks @kd7lxl and @AndrewFarley |
@iamNoah1: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@iamNoah1 this issue is not really solved by virtue of implementing custom bucket sizes, is it? There's still the question of documenting the metrics in terms of what they all mean. |
@towolf yes you are right, I was too fast closing this one. Do you think it would make sense you open a new issue so that we don't have all the clutter in the comments that distract from the actual issue? |
Actually I think this issue contained some worthwhile discussion points, but I guess not much of it is actually |
hmm ok, let's |
@iamNoah1: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind documentation |
Since |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@strongjz I will do it. But we need another patch with consistent metrics naming before. I will prepare decision making doc tomorrow. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Is there any way of turning the endpoint label on? |
anyway, we can override the value? or increase bucket value? |
Or have this correlate to the set timeout ... because that is the real upper limit ;-) |
So the 10s bucket is hardcoded and there's no way to add more?! |
HELP nginx_ingress_controller_bytes_sent The the number of bytes sent to a client
TYPE nginx_ingress_controller_bytes_sent histogram
Labels:
HELP nginx_ingress_controller_config_hash Running configuration hash actually running
TYPE nginx_ingress_controller_config_hash gauge
Labels:
HELP nginx_ingress_controller_config_last_reload_successful Whether the last configuration reload attempt was successful
TYPE nginx_ingress_controller_config_last_reload_successful gauge
Labels:
HELP nginx_ingress_controller_config_last_reload_successful_timestamp_seconds Timestamp of the last successful configuration reload.
TYPE nginx_ingress_controller_config_last_reload_successful_timestamp_seconds gauge
Labels:
HELP nginx_ingress_controller_ingress_upstream_latency_seconds Upstream service latency per Ingress
TYPE nginx_ingress_controller_ingress_upstream_latency_seconds summary
Labels:
HELP nginx_ingress_controller_nginx_process_connections current number of client connections with state {reading, writing, waiting}
TYPE nginx_ingress_controller_nginx_process_connections gauge
Labels:
HELP nginx_ingress_controller_nginx_process_connections_total total number of connections with state {active, accepted, handled}
TYPE nginx_ingress_controller_nginx_process_connections_total counter
Labels:
HELP nginx_ingress_controller_nginx_process_cpu_seconds_total Cpu usage in seconds
TYPE nginx_ingress_controller_nginx_process_cpu_seconds_total counter
Labels:
HELP nginx_ingress_controller_nginx_process_num_procs number of processes
TYPE nginx_ingress_controller_nginx_process_num_procs gauge
Labels:
HELP nginx_ingress_controller_nginx_process_oldest_start_time_seconds start time in seconds since 1970/01/01
TYPE nginx_ingress_controller_nginx_process_oldest_start_time_seconds gauge
Labels:
HELP nginx_ingress_controller_nginx_process_read_bytes_total number of bytes read
TYPE nginx_ingress_controller_nginx_process_read_bytes_total counter
Labels:
HELP nginx_ingress_controller_nginx_process_requests_total total number of client requests
TYPE nginx_ingress_controller_nginx_process_requests_total counter
Labels:
HELP nginx_ingress_controller_nginx_process_resident_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_resident_memory_bytes gauge
Labels:
HELP nginx_ingress_controller_nginx_process_virtual_memory_bytes number of bytes of memory in use
TYPE nginx_ingress_controller_nginx_process_virtual_memory_bytes gauge
Labels:
HELP nginx_ingress_controller_nginx_process_write_bytes_total number of bytes written
TYPE nginx_ingress_controller_nginx_process_write_bytes_total counter
Labels:
HELP nginx_ingress_controller_request_duration_seconds The request processing time in milliseconds
TYPE nginx_ingress_controller_request_duration_seconds histogram
Labels:
HELP nginx_ingress_controller_request_size The request length (including request line, header, and request body)
TYPE nginx_ingress_controller_request_size histogram
Labels:
HELP nginx_ingress_controller_requests The total number of client requests.
TYPE nginx_ingress_controller_requests counter
Labels:
The text was updated successfully, but these errors were encountered: