Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus [Input] plugin - Optimizing for bigger kubernetes clusters (500+ pods) when scraping thru 'monitor_kubernetes_pods' #8705

Closed
vishiy opened this issue Jan 15, 2021 · 4 comments
Labels
area/prometheus feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins

Comments

@vishiy
Copy link
Contributor

vishiy commented Jan 15, 2021

Feature Request

Prometheus [Input] plugin - Optimizing for bigger kubernetes clusters (500+ pods) when scraping thru 'monitor_kubernetes_pods'

Current behavior:

Currently when 'monitor_kubernetes_pods=true' , telegraf watches for pods with specific annotations to scrape metrics, as pods come & go (in all namespaces or in specified namespaces). This approach works for smaller clusters, its almost 100% not scaling in bigger clusters (more than 500+ pods), especially when running telegraf in A pod which does this scraping for all pods in the cluster. This is also a single point of scale failure/unreliability when scraping thru pod annotations using telegraf promethus input plugin.

Proposal:

To introduce an additional option (may be like 'local_mode' or something more intuitive), which when TRUE, will get ONLY pods that are running in that node. It will fetch podlist locally for that node from the node's kubelet (instead of watching them thru API server as it does today) and scrape the ones with the same annotations as it is today. This will require running Telegraf as daemonset (in every node) in the cluster, which will do pod scraping in each node locally, when enabled. By default, this will be backward compatible (meaning this new option will be turned OFF/false by default and users can turn ON as they see the need)

Desired behavior:

Pod annotation based scraping thru Telegraf, scale as k8s cluster scales.

Use case:

As Kubernetes starts to become defacto for running workloads, most production clusters are growing, and prometheus metric sources & metrics are widely available. To monitor them thru telegraf, we need Telegraf to have reliable way to scale & collect metrics as the cluster grows.

@vishiy vishiy added the feature request Requests for new plugin and for new features to existing plugins label Jan 15, 2021
@ssoroka
Copy link
Contributor

ssoroka commented Jan 21, 2021

This issue looks good. Do you think instead of a boolean flag it should ask for a list of nodes to query?

Either way I think we'd be in support of this, please do feel free to write up the PR if that's what you are intending.

@vishiy
Copy link
Contributor Author

vishiy commented Jan 22, 2021

List of nodes would not be appropriate, as pods keep moving nodes. We will submit a pr for this. Thanks.

@Hipska Hipska added area/prometheus plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin ready and removed feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin feature request Requests for new plugin and for new features to existing plugins labels Jan 22, 2021
@gracewehner
Copy link
Contributor

PR is out: #8762. Thanks.

@sjwang90 sjwang90 removed the ready label Jan 29, 2021
@sjwang90
Copy link
Contributor

Closed in #8762

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins
Projects
None yet
Development

No branches or pull requests

5 participants