daskhub: prometheus scraping of worker pods? #2366

consideRatio · 2023-03-16T17:48:43Z

I see from our dask-gateway client configuration, we make annotate our worker pods to be scraped. Is that used by us? If not, we should not do this because prometheus can easily be overloaded with scraping work I think. Especially if someone starts 100-1000 workers with 100 or so nodes that also have node metrics to be scraped.

infrastructure/helm-charts/daskhub/values.yaml

Lines 189 to 190 in cf7bfc3

    
                               "prometheus.io/scrape": "true", 
        
                               "prometheus.io/port": "8787",

consideRatio · 2023-06-21T14:55:57Z

pangeo-hubs have a prometheus-server with 20Gi memory request/limit, but they ran out of it recently again. I couldn't increase it further enough, so I whiped it clean by removing the PV and PVC.

pangeo-hubs have n1-highmem-4 machines with memory specified to 26 GB as compared to the n2-highmem-4 machines with 32GB of memory. If we had n2-highmem-4 we could have increased the pod request quite a bit more, but we would run into issues in a matter of time - just like we will by resetting this.

I think pangeo-hubs heavy use of dask-gateway, where I saw ~200 nodes etc startup with probably significantly more k8s pods overloads our prometheus-servers in various ways in general. Prometheus-server will require a large unbounded amount of memory when that much logs has been scraped historically, and may become unresponsive if its scraping from 200 nodes's node-exporters and even more dask-worker pods.

If we stopped scraping the worker pods for metrics, we may avoid issues like this to some degree, maybe not. I'm not sure what metrics they emit. Maybe the overload was from scraping node-exporter from ~200 nodes rather than the dask worker pods?

consideRatio · 2023-06-21T14:58:45Z

Action points

make a decision to stop scraping metrics or not from them
(optional) Conclude what metrics dask-worker and dask-scheduler pods expose
We find them listed in https://distributed.dask.org/en/latest/prometheus.html
(optional) Conclude if we are presenting such metrics via grafana or make use of it in any way
No we aren't presenting these metrics in grafana

consideRatio · 2023-06-21T16:01:29Z

I've concluded that prometheus-server is crashing during normal operation by being OOMKilled when its trying to bug failing to scrape huge amounts of things, which it does when loads of dask worker pods are running.

yuvipanda · 2023-06-21T16:13:06Z

I couldn't increase it further enough, so I whiped it clean by removing the PV and PVC.

Does this mean we have lost all usage data for these hubs? In the future, if the PV is full, it can be resized (https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/)

consideRatio · 2023-06-21T16:15:14Z

Yes we lost all historic metrics collected for these hubs. I believed startup failed because it was running out of memory during startup, because the amount of metrics stored in the PV.

Now I'm not so sure if that was the failure, now I know prometheus-server is OOMKilled during normal operation when ~200 nodes are started etc, exposing 200 node-exporter pods and even more dask-worker pods.

consideRatio · 2023-06-21T16:16:32Z

@yuvipanda hmmm hmm wait, maybe not correct that we lost the data. The PV had retain I think. So the actual disk is retained somewhere when I deleted the PV, and we got a new?

yuvipanda · 2023-06-21T16:24:16Z

@consideRatio if possible, we should try to recover that, as without it reporting on how much the hub is used (and possibly justify further usage) is impossible :(

consideRatio · 2023-06-21T17:29:50Z

@yuvipanda hmmm so I was about to look into the cloud console and inspect what ID etc the disk had, but, its pangeo-hubs. I don't have access =/ I opened #2688

consideRatio mentioned this issue Mar 16, 2023

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

Closed

5 tasks

consideRatio mentioned this issue Jun 21, 2023

support chart, prometheus: allow long startup times #2685

Merged

This was referenced Jun 21, 2023

daskhub: stop scraping worker pods for dask specific metrics #2686

Merged

pangeo-hubs: recover prometheus-metrics #2688

Closed

consideRatio closed this as completed in #2686 Jun 21, 2023

damianavila assigned consideRatio Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daskhub: prometheus scraping of worker pods? #2366

daskhub: prometheus scraping of worker pods? #2366

consideRatio commented Mar 16, 2023

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 21, 2023

consideRatio commented Jun 21, 2023

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 21, 2023

consideRatio commented Jun 21, 2023

daskhub: prometheus scraping of worker pods? #2366

daskhub: prometheus scraping of worker pods? #2366

Comments

consideRatio commented Mar 16, 2023

consideRatio commented Jun 21, 2023 • edited Loading

consideRatio commented Jun 21, 2023 • edited Loading

Action points

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 21, 2023

consideRatio commented Jun 21, 2023

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 21, 2023

consideRatio commented Jun 21, 2023

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Jun 21, 2023 •

edited

Loading