Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase memory & disk size limit for pangeo prometheus #1906

Merged
merged 1 commit into from
Nov 14, 2022

Conversation

yuvipanda
Copy link
Member

While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary spike in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts!

I manually hand edited the prometheus deployment with k -n support edit deployment support-prometheus-server, and gave it a higher limit (8G). Then I watched actual memory usage, with watch kubectl -n support top pod. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive.

This commit raises the memory limit, so it won't keep crashing :)

I also actually manually increased the size of the disk (with kubectl -n support edit pvc), but that wasn't the problem. However, we need to persist the change regardless, so here it is.

Hopefully this will fix #1843

While investigating why the pangeo hubs prometheus was dead,
I discovered prometheus/prometheus#6934 -
where there can be a big temporary *spike* in memory usage when
prometheus is recovering from a restart. It tries to read the WAL
(the write-ahead log), to make sure it hasn't lost any data during
the restart process itself. The details of the WAL are unimportant
(in this specific case), but just the fact that prometheus spikes
memory usage on restarts!

I manually hand edited the prometheus deployment with
`k -n support edit deployment support-prometheus-server`,
and gave it a higher limit (8G). Then I watched actual memory usage,
with `watch kubectl -n support top pod`. I noticed that it momentarily
spiked to almost 5G, before settling back to about 1.5G. The old memory
limit was 4G, so during the spike the server gets killed! And then
enters crashloopbackoff, as it can never survive.

This commit raises the memory limit, so it won't keep crashing :)

I also actually manually increased the size of the disk (with
`kubectl -n support edit pvc`), but that wasn't the problem. However,
we need to persist the change regardless, so here it is.

Hopefully this will fix 2i2c-org#1843
@yuvipanda yuvipanda requested a review from a team November 11, 2022 23:07
@sgibson91
Copy link
Member

@yuvipanda can we document the manual debug steps here please? https://infrastructure.2i2c.org/en/latest/sre-guide/common-problems-solutions.html In case this happens on another cluster in the future.

@yuvipanda
Copy link
Member Author

will do, @sgibson91!

@yuvipanda yuvipanda merged commit 267e949 into 2i2c-org:master Nov 14, 2022
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/3465810690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[Incident] Prometeus is down in pangeo-hubs
2 participants