Increase memory & disk size limit for pangeo prometheus #1906

yuvipanda · 2022-11-11T23:06:59Z

While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary spike in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts!

I manually hand edited the prometheus deployment with k -n support edit deployment support-prometheus-server, and gave it a higher limit (8G). Then I watched actual memory usage, with watch kubectl -n support top pod. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive.

This commit raises the memory limit, so it won't keep crashing :)

I also actually manually increased the size of the disk (with kubectl -n support edit pvc), but that wasn't the problem. However, we need to persist the change regardless, so here it is.

Hopefully this will fix #1843

While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary *spike* in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts! I manually hand edited the prometheus deployment with `k -n support edit deployment support-prometheus-server`, and gave it a higher limit (8G). Then I watched actual memory usage, with `watch kubectl -n support top pod`. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive. This commit raises the memory limit, so it won't keep crashing :) I also actually manually increased the size of the disk (with `kubectl -n support edit pvc`), but that wasn't the problem. However, we need to persist the change regardless, so here it is. Hopefully this will fix 2i2c-org#1843

sgibson91 · 2022-11-14T10:35:06Z

@yuvipanda can we document the manual debug steps here please? https://infrastructure.2i2c.org/en/latest/sre-guide/common-problems-solutions.html In case this happens on another cluster in the future.

yuvipanda · 2022-11-14T22:49:24Z

will do, @sgibson91!

github-actions · 2022-11-14T22:50:10Z

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/3465810690

yuvipanda requested a review from a team November 11, 2022 23:07

yuvipanda mentioned this pull request Nov 12, 2022

[Incident] Prometeus is down in pangeo-hubs #1843

Closed

5 tasks

sgibson91 approved these changes Nov 14, 2022

View reviewed changes

yuvipanda merged commit 267e949 into 2i2c-org:master Nov 14, 2022

yuvipanda mentioned this pull request Nov 14, 2022

Expand documentation about our support charts #1891

Closed

16 tasks

damianavila assigned yuvipanda Nov 15, 2022

consideRatio mentioned this pull request Feb 17, 2023

basehub: reduce prometheus-server's configured disk size (currently 100G) #2223

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase memory & disk size limit for pangeo prometheus #1906

Increase memory & disk size limit for pangeo prometheus #1906

yuvipanda commented Nov 11, 2022

sgibson91 commented Nov 14, 2022

yuvipanda commented Nov 14, 2022

github-actions bot commented Nov 14, 2022

Increase memory & disk size limit for pangeo prometheus #1906

Increase memory & disk size limit for pangeo prometheus #1906

Conversation

yuvipanda commented Nov 11, 2022

sgibson91 commented Nov 14, 2022

yuvipanda commented Nov 14, 2022

github-actions bot commented Nov 14, 2022