Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring EKS/GKE spot instance pre-emption events #2369

Open
3 tasks
consideRatio opened this issue Mar 19, 2023 · 0 comments
Open
3 tasks

Monitoring EKS/GKE spot instance pre-emption events #2369

consideRatio opened this issue Mar 19, 2023 · 0 comments

Comments

@consideRatio
Copy link
Member

consideRatio commented Mar 19, 2023

In this freshdesk ticket Julius with LEAP asks for help debugging the cause of why SIGKILL (signal 9) is sent to a dask-worker. This github issue is scoped to help us rule out one specific reason for future failures - that a cheaper spot instance has been pre-empted - by providing a) monitoring for pod evictions and b) a documented way to see if such events has occurred on AWS/EKS and GCP/GKE.

Background

Spot instances, also known as pre-emptible instances, are as compared to "on demand" instances something that you aren't guaranteed to be able to request or keep running. For this reason, they are significantly cheaper.

Two features

Monitoring

I've opened jupyterhub/grafana-dashboards#65 to help us work towards monitoring pod evictions, and I think the termination of pods on a spot instance node will be made using pod evictions.

Documentation to get more details

I suspect its relevant to see more details about this event than just a blip in grafana indicating a pre-emption event. Details such as a message on why. So if for example grafana provides a counter for how many evictions takes place, we may still want to learn more about them when they are observed to happen. I suspect there will be information in the k8s Event resources that only stays around for 60 minutes in a k8s cluster, but there may also be logs from something or notices made in some way outside k8s as well. Either capturing the k8s Events or the cloud provider details would be fine.

If we can learn how to retroactively inspect k8s Events related to pod evictions, that is a more general benefit though as pods can be evicted for manual drains, memory pressure, running out of ephemeral storage, etc.

Two cloud providers to focus on

GCP's GKE

Compute Engine gives you 30 seconds to shut down when you're preempted, letting you save your work in progress for later.

In practice for a GKE based k8s non-system Pod like a dask-gateway cluster's worker pod, they will have 15 seconds and not 30 seconds as a standalone VM has.

There will be a SIGTERM / 15 signal sent to the pod's containers, and after 15 seconds a SIGKILL / 9 signal is sent which makes it forcefully stopped. Ideally, the dask-worker being terminated would let the dask-scheduler know about the situation and terminate, but I'm not sure how it works.

AWS's EKS

TODO: Provide initial research and background here (anyone are welcome to update this issue!)

Action points

I'm not sure, there is a lot of investigative work about this initially. Here are some ideas on action points.

For monitoring:

For documentation:

  • Read up on docs and search the internet etc for ways to learn if a spot instance VM has been removed

Related

@consideRatio consideRatio changed the title Monitoring for EKS/GKE spot instance pre-emption events Monitoring EKS/GKE spot instance pre-emption events Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

1 participant