Prometheus [Input] plugin - Optimizing for bigger kubernetes clusters (500+ pods) when scraping thru 'monitor_kubernetes_pods' #8705
Labels
area/prometheus
feat
Improvement on an existing feature such as adding a new setting/mode to an existing plugin
plugin/input
1. Request for new input plugins 2. Issues/PRs that are related to input plugins
Feature Request
Prometheus [Input] plugin - Optimizing for bigger kubernetes clusters (500+ pods) when scraping thru 'monitor_kubernetes_pods'
Current behavior:
Currently when 'monitor_kubernetes_pods=true' , telegraf watches for pods with specific annotations to scrape metrics, as pods come & go (in all namespaces or in specified namespaces). This approach works for smaller clusters, its almost 100% not scaling in bigger clusters (more than 500+ pods), especially when running telegraf in A pod which does this scraping for all pods in the cluster. This is also a single point of scale failure/unreliability when scraping thru pod annotations using telegraf promethus input plugin.
Proposal:
To introduce an additional option (may be like 'local_mode' or something more intuitive), which when TRUE, will get ONLY pods that are running in that node. It will fetch podlist locally for that node from the node's kubelet (instead of watching them thru API server as it does today) and scrape the ones with the same annotations as it is today. This will require running Telegraf as daemonset (in every node) in the cluster, which will do pod scraping in each node locally, when enabled. By default, this will be backward compatible (meaning this new option will be turned OFF/false by default and users can turn ON as they see the need)
Desired behavior:
Pod annotation based scraping thru Telegraf, scale as k8s cluster scales.
Use case:
As Kubernetes starts to become defacto for running workloads, most production clusters are growing, and prometheus metric sources & metrics are widely available. To monitor them thru telegraf, we need Telegraf to have reliable way to scale & collect metrics as the cluster grows.
The text was updated successfully, but these errors were encountered: