[Incident] `leap` clusters `prod` hub - massive node scale up, hub pod restarted for unknown reason #2126

consideRatio · 2023-02-01T17:57:29Z

Summary

Impact on users

Important information

Hub URL: https://leap.2i2c.org
Support ticket ref: https://2i2c.freshdesk.com/a/tickets/414

Tasks and updates

Discuss and address incident, leaving comments below with updates
Incident has been dealt with or is over
Copy/paste the after-action report below and fill in relevant sections
Incident title is discoverable and accurate
All actionable items in report have linked GitHub Issues

After-action report template

# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

The text was updated successfully, but these errors were encountered:

consideRatio · 2023-02-01T18:52:22Z

Grafana reports

Observations

The ingress-nginx pod had run reliably, and the proxy pod had run reliably, for a long time. The hub pod however had been restarted. I didn't get to verify it was related to OOMKilling by kubectl get event -A | grep OOM before the 60 minute had elapsed - which is the duration the k8s api-server will remember event resources.

My theory is that the hub pod was running on a node with prometheus-server that hogged almost all of the memory, which made the hub pod get evicted when the node ran low on memory. So why did that happen just then?

It appears that ~100+ dask-worker nodes were added as a consequence of someone using dask_gateway to add at least ~100+ dask-worker pods. At this point, the KubeSpawner software running by JupyterHub in the hub pod probably got very busy as it observes what goes on, and at that point ended up consuming a significant additional amount of additional memory. And then, the node became short on memory and evicted forced the hub pod which had to restart.

Hmmm but it seems that the hub pod stayed consistently at a memory level, but that it became unresponsive. Maybe it was restarted by the livenessProbe failing 30 times in a row - as would happen if the hub was unresponsive for five minutes...

I wonder if we with CPU requests of 10milli-cores (0.01 CPU) ended up CPU starved also? I'm not sure. We should probably at least have 100m for these pods as they otherwise could be outcompeted to less than 1 CPU by pods with higher requests. Several pods have 100m requests and is therefore allocated 10x more CPU than the hub/proxy pods. I opened #2127.

Core node 1 / 4

  prod                        proxy-7f5dbcbd68-zqd22                                 10m (0%)      0 (0%)      64Mi (0%)        1Gi (4%)       94d

Core node 2 / 4

  prod                        hub-5db4d78fdc-bwbvv                                   10m (0%)      0 (0%)       128Mi (0%)       2Gi (9%)       60m
  support                     support-prometheus-server-7c6b85d57-jqlfj              3 (76%)       3900m (99%)  20Gi (90%)       22Gi (99%)     73m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        3473m (88%)    4 (102%)
  memory                     21196Mi (93%)  28408Mi (124%)

100+ nodes

In a matter of a minute, 100 nodes were added, probably along with 100+ dask worker pods.

prod          52m         Warning   FailedScheduling          pod/dask-worker-43a393f19842403fba58e236bddfeeb3-zg648    0/23 nodes are available: 13 node(s) didn't match Pod's node affinity/selector, 22 Insufficient cpu, 23 Insufficient memory, 9 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}. preemption: 0/23 nodes are available: 10 No preemption victims found for incoming pod, 13 Preemption is not helpful for scheduling.
prod          52m         Warning   FailedScheduling          pod/dask-worker-43a393f19842403fba58e236bddfeeb3-zg648    0/126 nodes are available: 14 node(s) didn't match Pod's node affinity/selector, 23 Insufficient cpu, 24 Insufficient memory, 6 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 9 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 96 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/126 nodes are available: 11 No preemption victims found for incoming pod, 115 Preemption is not helpful for scheduling.

damianavila · 2023-02-15T17:28:24Z

Reference actions already taken. Then we can write the incident report.

pnasrat · 2023-03-16T15:02:07Z

Triage: Is there any update for the community member who reported support ticket or should that be closed https://2i2c.freshdesk.com/a/tickets/414

damianavila · 2023-03-16T15:10:49Z

I think the ticket should be closed (cc @consideRatio who was involved in the incident).
And this issue needs an incident report before closure, IMHO.

consideRatio · 2023-03-16T17:50:15Z

2023-02-01 Heavy use of dask-gateway induced critical pod evictions

Timeline

All times in UTC+1

2023-02-01 - Summary of issue updated between ~8-9 PM

What went wrong

I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers

Follow-up improvements

consideRatio · 2023-03-16T18:11:14Z

Inicident report PR in 2i2c-org/incident-reports#4

consideRatio mentioned this issue Feb 1, 2023

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

Open

consideRatio changed the title ~~[Incident] leap clusters prod hub - massive node scalupe, proxy pod evicted~~ [Incident] leap clusters prod hub - massive node scalupe, hub pod restarted for unknown reason Feb 1, 2023

damianavila assigned consideRatio Feb 15, 2023

consideRatio changed the title ~~[Incident] leap clusters prod hub - massive node scalupe, hub pod restarted for unknown reason~~ [Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Feb 15, 2023

damianavila changed the title ~~[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason~~ [Non-active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 1, 2023

damianavila changed the title ~~[Non-active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason~~ [Non active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 8, 2023

consideRatio changed the title ~~[Non active][Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason~~ [Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason Mar 16, 2023

consideRatio closed this as completed Mar 16, 2023

consideRatio mentioned this issue Mar 16, 2023

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage 2i2c-org/incident-reports#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] `leap` clusters `prod` hub - massive node scale up, hub pod restarted for unknown reason #2126

[Incident] `leap` clusters `prod` hub - massive node scale up, hub pod restarted for unknown reason #2126

consideRatio commented Feb 1, 2023 •

edited

Loading

consideRatio commented Feb 1, 2023 •

edited

Loading

damianavila commented Feb 15, 2023

pnasrat commented Mar 16, 2023

damianavila commented Mar 16, 2023

consideRatio commented Mar 16, 2023 •

edited

Loading

consideRatio commented Mar 16, 2023

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

Comments

consideRatio commented Feb 1, 2023 • edited Loading

Summary

Impact on users

Important information

Tasks and updates

consideRatio commented Feb 1, 2023 • edited Loading

Grafana reports

Observations

Core node 1 / 4

Core node 2 / 4

100+ nodes

damianavila commented Feb 15, 2023

pnasrat commented Mar 16, 2023

damianavila commented Mar 16, 2023

consideRatio commented Mar 16, 2023 • edited Loading

2023-02-01 Heavy use of dask-gateway induced critical pod evictions

Timeline

What went wrong

Follow-up improvements

consideRatio commented Mar 16, 2023

[Incident] `leap` clusters `prod` hub - massive node scale up, hub pod restarted for unknown reason #2126

[Incident] `leap` clusters `prod` hub - massive node scale up, hub pod restarted for unknown reason #2126

consideRatio commented Feb 1, 2023 •

edited

Loading

consideRatio commented Feb 1, 2023 •

edited

Loading

consideRatio commented Mar 16, 2023 •

edited

Loading