-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] Prometeus is down in pangeo-hubs #1843
Comments
Some notes from @yuvipanda's investigation in Slack: Stage 1:
Stage 2:
@yuvipanda can you confirm if you followed the steps outlined in the last item? It is not clear to me from the description in Slack. Also, feel free to edit anything I have written here if I am not capturing things properly. What would be the next steps here? In terms of prioritization, I think this is a high prio, but not a critical thing... at least not for now. Thoughts? |
While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary *spike* in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts! I manually hand edited the prometheus deployment with `k -n support edit deployment support-prometheus-server`, and gave it a higher limit (8G). Then I watched actual memory usage, with `watch kubectl -n support top pod`. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive. This commit raises the memory limit, so it won't keep crashing :) I also actually manually increased the size of the disk (with `kubectl -n support edit pvc`), but that wasn't the problem. However, we need to persist the change regardless, so here it is. Hopefully this will fix 2i2c-org#1843
Comment from @yuvipanda in Slack:
|
@damianavila more detailed fix in #1906 |
While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary *spike* in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts! I manually hand edited the prometheus deployment with `k -n support edit deployment support-prometheus-server`, and gave it a higher limit (8G). Then I watched actual memory usage, with `watch kubectl -n support top pod`. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive. This commit raises the memory limit, so it won't keep crashing :) I also actually manually increased the size of the disk (with `kubectl -n support edit pvc`), but that wasn't the problem. However, we need to persist the change regardless, so here it is. Hopefully this will fix 2i2c-org#1843
Summary
Impact on users
This is not impacting the users, but we have zero visibility into usage on the pangeo-hubs.
Important information
Tasks and updates
After-action report template
The text was updated successfully, but these errors were encountered: