Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] Prometeus is down in pangeo-hubs #1843

Closed
5 tasks
damianavila opened this issue Nov 1, 2022 · 3 comments · Fixed by #1906
Closed
5 tasks

[Incident] Prometeus is down in pangeo-hubs #1843

damianavila opened this issue Nov 1, 2022 · 3 comments · Fixed by #1906
Assignees

Comments

@damianavila
Copy link
Contributor

damianavila commented Nov 1, 2022

Summary

Impact on users

This is not impacting the users, but we have zero visibility into usage on the pangeo-hubs.

Important information

  • Hub URL: us-central1-b.gcp.pangeo.io
  • Support ticket ref: No support ticket, AFAIK this one was caught by the uptime check.

Tasks and updates

  • Discuss and address incident, leaving comments below with updates
  • Incident has been dealt with or is over
  • Copy/paste the after-action report below and fill in relevant sections
  • Incident title is discoverable and accurate
  • All actionable items in report have linked GitHub Issues
After-action report template
# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]
@damianavila damianavila changed the title [Incident] Prometeus is down in pangeo cluster [Incident] Prometeus is down in pangeo-hubs Nov 1, 2022
@damianavila
Copy link
Contributor Author

damianavila commented Nov 1, 2022

Some notes from @yuvipanda's investigation in Slack:

Stage 1:

  • It's been dead for months at least, readiness probe fails and restarts server pod
  • Deleted /data/wal and restarted pod, to see if that helps
    • WAL is Write Ahead Log, and is used to recover from crashes.
    • Clearly it isn't helping us here.
  • That worked!
    • We still have missing data for a lot of time but the prometheus server should be back up now
  • Ok so I think this meant we ran into some kinda bug in prometheus that is triggered on WAL recovery in some cases that caused it to restart.
    • I ran kubectl top pod -n support and found memory usage is well under the 2G limit, so the problem isn't it running out of RAM.

Stage 2:

  • This recurred again.
  • I deleted the deployment and redeployed to see if it was a mixup in the readiness rules?
    • Well that didn't help.
  • So I,m going to just kill the disk by:
      1. editing the PV to set reclaim policy to 'retain' (so raw data is preserved) and
      1. deleting the pvc,
      1. deleting the pod so the pvc is released,
      1. redeploying to get a fresh disk

@yuvipanda can you confirm if you followed the steps outlined in the last item? It is not clear to me from the description in Slack. Also, feel free to edit anything I have written here if I am not capturing things properly.

What would be the next steps here? In terms of prioritization, I think this is a high prio, but not a critical thing... at least not for now. Thoughts?

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Nov 11, 2022
While investigating why the pangeo hubs prometheus was dead,
I discovered prometheus/prometheus#6934 -
where there can be a big temporary *spike* in memory usage when
prometheus is recovering from a restart. It tries to read the WAL
(the write-ahead log), to make sure it hasn't lost any data during
the restart process itself. The details of the WAL are unimportant
(in this specific case), but just the fact that prometheus spikes
memory usage on restarts!

I manually hand edited the prometheus deployment with
`k -n support edit deployment support-prometheus-server`,
and gave it a higher limit (8G). Then I watched actual memory usage,
with `watch kubectl -n support top pod`. I noticed that it momentarily
spiked to almost 5G, before settling back to about 1.5G. The old memory
limit was 4G, so during the spike the server gets killed! And then
enters crashloopbackoff, as it can never survive.

This commit raises the memory limit, so it won't keep crashing :)

I also actually manually increased the size of the disk (with
`kubectl -n support edit pvc`), but that wasn't the problem. However,
we need to persist the change regardless, so here it is.

Hopefully this will fix 2i2c-org#1843
@damianavila
Copy link
Contributor Author

Comment from @yuvipanda in Slack:

I was wondering if this was just the disk getting full, so resized it to 200G. IT wasn't that 😞
/dev/sdb 196.4G 2.9G 193.4G 1% /data

@yuvipanda
Copy link
Member

@damianavila more detailed fix in #1906

GeorgianaElena pushed a commit to GeorgianaElena/pilot-hubs that referenced this issue Nov 15, 2022
While investigating why the pangeo hubs prometheus was dead,
I discovered prometheus/prometheus#6934 -
where there can be a big temporary *spike* in memory usage when
prometheus is recovering from a restart. It tries to read the WAL
(the write-ahead log), to make sure it hasn't lost any data during
the restart process itself. The details of the WAL are unimportant
(in this specific case), but just the fact that prometheus spikes
memory usage on restarts!

I manually hand edited the prometheus deployment with
`k -n support edit deployment support-prometheus-server`,
and gave it a higher limit (8G). Then I watched actual memory usage,
with `watch kubectl -n support top pod`. I noticed that it momentarily
spiked to almost 5G, before settling back to about 1.5G. The old memory
limit was 4G, so during the spike the server gets killed! And then
enters crashloopbackoff, as it can never survive.

This commit raises the memory limit, so it won't keep crashing :)

I also actually manually increased the size of the disk (with
`kubectl -n support edit pvc`), but that wasn't the problem. However,
we need to persist the change regardless, so here it is.

Hopefully this will fix 2i2c-org#1843
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants