-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daskhub: prometheus scraping of worker pods? #2366
Comments
pangeo-hubs have a prometheus-server with 20Gi memory request/limit, but they ran out of it recently again. I couldn't increase it further enough, so I whiped it clean by removing the PV and PVC. pangeo-hubs have n1-highmem-4 machines with memory specified to 26 GB as compared to the n2-highmem-4 machines with 32GB of memory. If we had n2-highmem-4 we could have increased the pod request quite a bit more, but we would run into issues in a matter of time - just like we will by resetting this. I think pangeo-hubs heavy use of dask-gateway, where I saw ~200 nodes etc startup with probably significantly more k8s pods overloads our prometheus-servers in various ways in general. Prometheus-server will require a large unbounded amount of memory when that much logs has been scraped historically, and may become unresponsive if its scraping from 200 nodes's node-exporters and even more dask-worker pods. If we stopped scraping the worker pods for metrics, we may avoid issues like this to some degree, maybe not. I'm not sure what metrics they emit. Maybe the overload was from scraping node-exporter from ~200 nodes rather than the dask worker pods? |
Action points
|
I've concluded that prometheus-server is crashing during normal operation by being |
Does this mean we have lost all usage data for these hubs? In the future, if the PV is full, it can be resized (https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/) |
Yes we lost all historic metrics collected for these hubs. I believed startup failed because it was running out of memory during startup, because the amount of metrics stored in the PV. Now I'm not so sure if that was the failure, now I know prometheus-server is OOMKilled during normal operation when ~200 nodes are started etc, exposing 200 node-exporter pods and even more dask-worker pods. |
@yuvipanda hmmm hmm wait, maybe not correct that we lost the data. The PV had |
@consideRatio if possible, we should try to recover that, as without it reporting on how much the hub is used (and possibly justify further usage) is impossible :( |
@yuvipanda hmmm so I was about to look into the cloud console and inspect what ID etc the disk had, but, its pangeo-hubs. I don't have access =/ I opened #2688 |
I see from our dask-gateway client configuration, we make annotate our worker pods to be scraped. Is that used by us? If not, we should not do this because prometheus can easily be overloaded with scraping work I think. Especially if someone starts 100-1000 workers with 100 or so nodes that also have node metrics to be scraped.
infrastructure/helm-charts/daskhub/values.yaml
Lines 189 to 190 in cf7bfc3
The text was updated successfully, but these errors were encountered: