Monitoring EKS/GKE spot instance pre-emption events #2369

consideRatio · 2023-03-19T11:38:31Z

In this freshdesk ticket Julius with LEAP asks for help debugging the cause of why SIGKILL (signal 9) is sent to a dask-worker. This github issue is scoped to help us rule out one specific reason for future failures - that a cheaper spot instance has been pre-empted - by providing a) monitoring for pod evictions and b) a documented way to see if such events has occurred on AWS/EKS and GCP/GKE.

Background

Spot instances, also known as pre-emptible instances, are as compared to "on demand" instances something that you aren't guaranteed to be able to request or keep running. For this reason, they are significantly cheaper.

Two features

Monitoring

I've opened jupyterhub/grafana-dashboards#65 to help us work towards monitoring pod evictions, and I think the termination of pods on a spot instance node will be made using pod evictions.

Documentation to get more details

I suspect its relevant to see more details about this event than just a blip in grafana indicating a pre-emption event. Details such as a message on why. So if for example grafana provides a counter for how many evictions takes place, we may still want to learn more about them when they are observed to happen. I suspect there will be information in the k8s Event resources that only stays around for 60 minutes in a k8s cluster, but there may also be logs from something or notices made in some way outside k8s as well. Either capturing the k8s Events or the cloud provider details would be fine.

If we can learn how to retroactively inspect k8s Events related to pod evictions, that is a more general benefit though as pods can be evicted for manual drains, memory pressure, running out of ephemeral storage, etc.

Two cloud providers to focus on

GCP's GKE

Compute Engine gives you 30 seconds to shut down when you're preempted, letting you save your work in progress for later.

In practice for a GKE based k8s non-system Pod like a dask-gateway cluster's worker pod, they will have 15 seconds and not 30 seconds as a standalone VM has.

There will be a SIGTERM / 15 signal sent to the pod's containers, and after 15 seconds a SIGKILL / 9 signal is sent which makes it forcefully stopped. Ideally, the dask-worker being terminated would let the dask-scheduler know about the situation and terminate, but I'm not sure how it works.

AWS's EKS

TODO: Provide initial research and background here (anyone are welcome to update this issue!)

Action points

I'm not sure, there is a lot of investigative work about this initially. Here are some ideas on action points.

For monitoring:

Verify the belief that spot instance shutdown implies pod eviction with SIGTERM being sent out
Work to resolve Dashboard panel for pod evictions (out of memory, out of ephemeral space, manual node drains) jupyterhub/grafana-dashboards#65

For documentation:

Read up on docs and search the internet etc for ways to learn if a spot instance VM has been removed

GCP's general docs about spot instances
- About the preemption process
  
  You can simulate the preemption of a VM by stopping the VM or deleting the VM accordingly.
GCP's GKE docs about spot instances
AWS's docs on determening if AWS terminated a spot instance
AWS's general docs about spot instances
AWS's EKS best practices of using spot instances
AWS's docs about terminating spot instances

consideRatio mentioned this issue Mar 19, 2023

Overview of grafana and prometheus related issues #2214

Open

consideRatio changed the title ~~Monitoring for EKS/GKE spot instance pre-emption events~~ Monitoring EKS/GKE spot instance pre-emption events Mar 19, 2023

consideRatio added the tech:grafana label Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring EKS/GKE spot instance pre-emption events #2369

Monitoring EKS/GKE spot instance pre-emption events #2369

consideRatio commented Mar 19, 2023 •

edited

Loading

Monitoring EKS/GKE spot instance pre-emption events #2369

Monitoring EKS/GKE spot instance pre-emption events #2369

Comments

consideRatio commented Mar 19, 2023 • edited Loading

Background

Two features

Monitoring

Documentation to get more details

Two cloud providers to focus on

GCP's GKE

AWS's EKS

Action points

Related

consideRatio commented Mar 19, 2023 •

edited

Loading