[Incident] Prometeus is down in pangeo-hubs #1843

damianavila · 2022-11-01T01:12:32Z

Summary

PageDuty incident: https://2i2c-org.pagerduty.com/incidents/Q0ZH5DBGBCSUNP (it says resolved but @yuvipanda confirmed this one has recurred).
Dedicated Slack channel: https://app.slack.com/client/TKHFGBRC2/C046PBG348J

Impact on users

This is not impacting the users, but we have zero visibility into usage on the pangeo-hubs.

Important information

Hub URL: us-central1-b.gcp.pangeo.io
Support ticket ref: No support ticket, AFAIK this one was caught by the uptime check.

Tasks and updates

Discuss and address incident, leaving comments below with updates
Incident has been dealt with or is over
Copy/paste the after-action report below and fill in relevant sections
Incident title is discoverable and accurate
All actionable items in report have linked GitHub Issues

After-action report template

# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

The text was updated successfully, but these errors were encountered:

damianavila · 2022-11-01T01:24:16Z

Some notes from @yuvipanda's investigation in Slack:

Stage 1:

It's been dead for months at least, readiness probe fails and restarts server pod
Deleted /data/wal and restarted pod, to see if that helps
- WAL is Write Ahead Log, and is used to recover from crashes.
- Clearly it isn't helping us here.
That worked!
- We still have missing data for a lot of time but the prometheus server should be back up now
Ok so I think this meant we ran into some kinda bug in prometheus that is triggered on WAL recovery in some cases that caused it to restart.
- I ran kubectl top pod -n support and found memory usage is well under the 2G limit, so the problem isn't it running out of RAM.

Stage 2:

This recurred again.
I deleted the deployment and redeployed to see if it was a mixup in the readiness rules?
- Well that didn't help.
So I,m going to just kill the disk by:
- 1. editing the PV to set reclaim policy to 'retain' (so raw data is preserved) and
- 1. deleting the pvc,
- 1. deleting the pod so the pvc is released,
- 1. redeploying to get a fresh disk

@yuvipanda can you confirm if you followed the steps outlined in the last item? It is not clear to me from the description in Slack. Also, feel free to edit anything I have written here if I am not capturing things properly.

What would be the next steps here? In terms of prioritization, I think this is a high prio, but not a critical thing... at least not for now. Thoughts?

While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary *spike* in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts! I manually hand edited the prometheus deployment with `k -n support edit deployment support-prometheus-server`, and gave it a higher limit (8G). Then I watched actual memory usage, with `watch kubectl -n support top pod`. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive. This commit raises the memory limit, so it won't keep crashing :) I also actually manually increased the size of the disk (with `kubectl -n support edit pvc`), but that wasn't the problem. However, we need to persist the change regardless, so here it is. Hopefully this will fix 2i2c-org#1843

damianavila · 2022-11-12T00:06:19Z

Comment from @yuvipanda in Slack:

I was wondering if this was just the disk getting full, so resized it to 200G. IT wasn't that 😞
/dev/sdb 196.4G 2.9G 193.4G 1% /data

yuvipanda · 2022-11-12T00:20:37Z

@damianavila more detailed fix in #1906

While investigating why the pangeo hubs prometheus was dead, I discovered prometheus/prometheus#6934 - where there can be a big temporary *spike* in memory usage when prometheus is recovering from a restart. It tries to read the WAL (the write-ahead log), to make sure it hasn't lost any data during the restart process itself. The details of the WAL are unimportant (in this specific case), but just the fact that prometheus spikes memory usage on restarts! I manually hand edited the prometheus deployment with `k -n support edit deployment support-prometheus-server`, and gave it a higher limit (8G). Then I watched actual memory usage, with `watch kubectl -n support top pod`. I noticed that it momentarily spiked to almost 5G, before settling back to about 1.5G. The old memory limit was 4G, so during the spike the server gets killed! And then enters crashloopbackoff, as it can never survive. This commit raises the memory limit, so it won't keep crashing :) I also actually manually increased the size of the disk (with `kubectl -n support edit pvc`), but that wasn't the problem. However, we need to persist the change regardless, so here it is. Hopefully this will fix 2i2c-org#1843

damianavila changed the title ~~[Incident] Prometeus is down in pangeo cluster~~ [Incident] Prometeus is down in pangeo-hubs Nov 1, 2022

damianavila mentioned this issue Nov 1, 2022

Restructure our infrastructure docs #1757

Closed

5 tasks

damianavila assigned yuvipanda Nov 9, 2022

yuvipanda mentioned this issue Nov 11, 2022

Increase memory & disk size limit for pangeo prometheus #1906

Merged

yuvipanda closed this as completed in #1906 Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] Prometeus is down in pangeo-hubs #1843

[Incident] Prometeus is down in pangeo-hubs #1843

damianavila commented Nov 1, 2022 •

edited

Loading

damianavila commented Nov 1, 2022 •

edited

Loading

damianavila commented Nov 12, 2022

yuvipanda commented Nov 12, 2022

[Incident] Prometeus is down in pangeo-hubs #1843

[Incident] Prometeus is down in pangeo-hubs #1843

Comments

damianavila commented Nov 1, 2022 • edited Loading

Summary

Impact on users

Important information

Tasks and updates

damianavila commented Nov 1, 2022 • edited Loading

damianavila commented Nov 12, 2022

yuvipanda commented Nov 12, 2022

damianavila commented Nov 1, 2022 •

edited

Loading

damianavila commented Nov 1, 2022 •

edited

Loading