Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daskhub: prometheus scraping of worker pods? #2366

Closed
consideRatio opened this issue Mar 16, 2023 · 8 comments · Fixed by #2686
Closed

daskhub: prometheus scraping of worker pods? #2366

consideRatio opened this issue Mar 16, 2023 · 8 comments · Fixed by #2686
Assignees

Comments

@consideRatio
Copy link
Member

I see from our dask-gateway client configuration, we make annotate our worker pods to be scraped. Is that used by us? If not, we should not do this because prometheus can easily be overloaded with scraping work I think. Especially if someone starts 100-1000 workers with 100 or so nodes that also have node metrics to be scraped.

"prometheus.io/scrape": "true",
"prometheus.io/port": "8787",

@consideRatio
Copy link
Member Author

consideRatio commented Jun 21, 2023

pangeo-hubs have a prometheus-server with 20Gi memory request/limit, but they ran out of it recently again. I couldn't increase it further enough, so I whiped it clean by removing the PV and PVC.

pangeo-hubs have n1-highmem-4 machines with memory specified to 26 GB as compared to the n2-highmem-4 machines with 32GB of memory. If we had n2-highmem-4 we could have increased the pod request quite a bit more, but we would run into issues in a matter of time - just like we will by resetting this.

I think pangeo-hubs heavy use of dask-gateway, where I saw ~200 nodes etc startup with probably significantly more k8s pods overloads our prometheus-servers in various ways in general. Prometheus-server will require a large unbounded amount of memory when that much logs has been scraped historically, and may become unresponsive if its scraping from 200 nodes's node-exporters and even more dask-worker pods.

If we stopped scraping the worker pods for metrics, we may avoid issues like this to some degree, maybe not. I'm not sure what metrics they emit. Maybe the overload was from scraping node-exporter from ~200 nodes rather than the dask worker pods?

@consideRatio
Copy link
Member Author

consideRatio commented Jun 21, 2023

Action points

  • make a decision to stop scraping metrics or not from them
  • (optional) Conclude what metrics dask-worker and dask-scheduler pods expose
    We find them listed in https://distributed.dask.org/en/latest/prometheus.html
  • (optional) Conclude if we are presenting such metrics via grafana or make use of it in any way
    No we aren't presenting these metrics in grafana

@consideRatio
Copy link
Member Author

I've concluded that prometheus-server is crashing during normal operation by being OOMKilled when its trying to bug failing to scrape huge amounts of things, which it does when loads of dask worker pods are running.

@yuvipanda
Copy link
Member

I couldn't increase it further enough, so I whiped it clean by removing the PV and PVC.

Does this mean we have lost all usage data for these hubs? In the future, if the PV is full, it can be resized (https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/)

@consideRatio
Copy link
Member Author

Yes we lost all historic metrics collected for these hubs. I believed startup failed because it was running out of memory during startup, because the amount of metrics stored in the PV.

Now I'm not so sure if that was the failure, now I know prometheus-server is OOMKilled during normal operation when ~200 nodes are started etc, exposing 200 node-exporter pods and even more dask-worker pods.

@consideRatio
Copy link
Member Author

@yuvipanda hmmm hmm wait, maybe not correct that we lost the data. The PV had retain I think. So the actual disk is retained somewhere when I deleted the PV, and we got a new?

@yuvipanda
Copy link
Member

@consideRatio if possible, we should try to recover that, as without it reporting on how much the hub is used (and possibly justify further usage) is impossible :(

@consideRatio
Copy link
Member Author

@yuvipanda hmmm so I was about to look into the cloud console and inspect what ID etc the disk had, but, its pangeo-hubs. I don't have access =/ I opened #2688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants