Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

OrenZ1 · 2023-12-27T20:36:53Z

Problem

I am facing a problem when using JEG on kubernetes.
I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels.
When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels.
In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.

Proposed Solution

For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them.
For the other issue I’ve stated I can see two possible solutions:
The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending.
The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.

OrenZ1 · 2024-01-25T13:08:43Z

If I can get an update about this request, that would be great.
I will be happy to contribute and add this option, so if you can state the relevant files, I can try to implement this and contribute :)

kevin-bates · 2024-01-25T22:23:36Z

Hi @OrenZ1 - I apologize for the delay. Unfortunately, I'm unable to spend much time on EG (and Jupyter in general) these days.

I think this would be a great addition. Ideally, if we can determine that a Pending state is going to remain pending until the prescribed (and long) timeout, it would better to abort. The location where we can detect this during the startup sequence is in the KubernetesProcessProxy and the status loop where we could add more intelligence would be here.

I hope you find that helpful but imagine you've probably poked around a bit already so let me know if this isn't what you were looking for.

Thank you for your interest and helping out!

lresende · 2024-01-25T22:49:26Z

There are multiple ways you can go about this:

Configure kernel image pullers to avoid delays in downloading images and reduce the startup timeouts
Configure culling kernels to avoid kernels wasting resources
If this is related to spark? Enable dynamic allocation to help reduce idle usage of resources

Also, having what @kevin-bates proposes above would not only help your use case but also fix a file-handlers leak that I have seen in the past.

OrenZ1 · 2024-02-29T14:02:30Z

Hi! Sorry for the delay but I managed to make a PR for the first thing we've discussed here!
For now the PR is for when the kernel pod dies while still in startup, the EG will throw a matching exception to the user, to prevent the need to wait for timeout.

I am still trying to think of a way to handle kernels which are stuck on Pending state. Hope to make a different PR for that too soon :)
#1370

OrenZ1 · 2024-06-02T14:03:09Z

Just created a new PR, which enables the option to configure different timeouts for different events which occur during startup, including a "0 seconds" timeout -which means the startup will terminate immediately after such event occurs.
#1383

OrenZ1 added the enhancement label Dec 27, 2023

OrenZ1 mentioned this issue Feb 29, 2024

Better handling of an empty container status in confirm_remote_startup #1370

Open

OrenZ1 mentioned this issue Jun 2, 2024

Feature/configure kernel launch terminate on events #1383

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

OrenZ1 commented Dec 27, 2023 •

edited

Loading

OrenZ1 commented Jan 25, 2024

kevin-bates commented Jan 25, 2024

lresende commented Jan 25, 2024

OrenZ1 commented Feb 29, 2024

OrenZ1 commented Jun 2, 2024 •

edited

Loading

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

Comments

OrenZ1 commented Dec 27, 2023 • edited Loading

Problem

Proposed Solution

OrenZ1 commented Jan 25, 2024

kevin-bates commented Jan 25, 2024

lresende commented Jan 25, 2024

OrenZ1 commented Feb 29, 2024

OrenZ1 commented Jun 2, 2024 • edited Loading

OrenZ1 commented Dec 27, 2023 •

edited

Loading

OrenZ1 commented Jun 2, 2024 •

edited

Loading