Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

Closed
2 of 5 tasks
consideRatio opened this issue Feb 1, 2023 · 6 comments
Assignees

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 1, 2023

Summary

Impact on users

Important information

Tasks and updates

  • Discuss and address incident, leaving comments below with updates
  • Incident has been dealt with or is over
  • Copy/paste the after-action report below and fill in relevant sections
  • Incident title is discoverable and accurate
  • All actionable items in report have linked GitHub Issues
After-action report template
# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]
@consideRatio
Copy link
Member Author

consideRatio commented Feb 1, 2023

Grafana reports

image
image
image
image

Observations

The ingress-nginx pod had run reliably, and the proxy pod had run reliably, for a long time. The hub pod however had been restarted. I didn't get to verify it was related to OOMKilling by kubectl get event -A | grep OOM before the 60 minute had elapsed - which is the duration the k8s api-server will remember event resources.

My theory is that the hub pod was running on a node with prometheus-server that hogged almost all of the memory, which made the hub pod get evicted when the node ran low on memory. So why did that happen just then?

It appears that ~100+ dask-worker nodes were added as a consequence of someone using dask_gateway to add at least ~100+ dask-worker pods. At this point, the KubeSpawner software running by JupyterHub in the hub pod probably got very busy as it observes what goes on, and at that point ended up consuming a significant additional amount of additional memory. And then, the node became short on memory and evicted forced the hub pod which had to restart.

Hmmm but it seems that the hub pod stayed consistently at a memory level, but that it became unresponsive. Maybe it was restarted by the livenessProbe failing 30 times in a row - as would happen if the hub was unresponsive for five minutes...

I wonder if we with CPU requests of 10milli-cores (0.01 CPU) ended up CPU starved also? I'm not sure. We should probably at least have 100m for these pods as they otherwise could be outcompeted to less than 1 CPU by pods with higher requests. Several pods have 100m requests and is therefore allocated 10x more CPU than the hub/proxy pods. I opened #2127.

Core node 1 / 4

  prod                        proxy-7f5dbcbd68-zqd22                                 10m (0%)      0 (0%)      64Mi (0%)        1Gi (4%)       94d

Core node 2 / 4

  prod                        hub-5db4d78fdc-bwbvv                                   10m (0%)      0 (0%)       128Mi (0%)       2Gi (9%)       60m
  support                     support-prometheus-server-7c6b85d57-jqlfj              3 (76%)       3900m (99%)  20Gi (90%)       22Gi (99%)     73m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        3473m (88%)    4 (102%)
  memory                     21196Mi (93%)  28408Mi (124%)

100+ nodes

In a matter of a minute, 100 nodes were added, probably along with 100+ dask worker pods.

prod          52m         Warning   FailedScheduling          pod/dask-worker-43a393f19842403fba58e236bddfeeb3-zg648    0/23 nodes are available: 13 node(s) didn't match Pod's node affinity/selector, 22 Insufficient cpu, 23 Insufficient memory, 9 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}. preemption: 0/23 nodes are available: 10 No preemption victims found for incoming pod, 13 Preemption is not helpful for scheduling.
prod          52m         Warning   FailedScheduling          pod/dask-worker-43a393f19842403fba58e236bddfeeb3-zg648    0/126 nodes are available: 14 node(s) didn't match Pod's node affinity/selector, 23 Insufficient cpu, 24 Insufficient memory, 6 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 9 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 96 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/126 nodes are available: 11 No preemption victims found for incoming pod, 115 Preemption is not helpful for scheduling.

@consideRatio consideRatio changed the title [Incident] leap clusters prod hub - massive node scalupe, proxy pod evicted [Incident] leap clusters prod hub - massive node scalupe, hub pod restarted for unknown reason Feb 1, 2023
@damianavila
Copy link
Contributor

Reference actions already taken. Then we can write the incident report.

@consideRatio consideRatio changed the title [Incident] leap clusters prod hub - massive node scalupe, hub pod restarted for unknown reason [Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Feb 15, 2023
@damianavila damianavila changed the title [Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason [Non-active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 1, 2023
@damianavila damianavila changed the title [Non-active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason [Non active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 8, 2023
@pnasrat
Copy link
Contributor

pnasrat commented Mar 16, 2023

Triage: Is there any update for the community member who reported support ticket or should that be closed https://2i2c.freshdesk.com/a/tickets/414

@damianavila
Copy link
Contributor

I think the ticket should be closed (cc @consideRatio who was involved in the incident).
And this issue needs an incident report before closure, IMHO.

@consideRatio
Copy link
Member Author

consideRatio commented Mar 16, 2023

2023-02-01 Heavy use of dask-gateway induced critical pod evictions

Timeline

All times in UTC+1

What went wrong

  • I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
  • I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers

Follow-up improvements

@consideRatio consideRatio changed the title [Non active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason [Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 16, 2023
@consideRatio
Copy link
Member Author

Inicident report PR in 2i2c-org/incident-reports#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants