-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of users that GESIS server can have #3080
Comments
…h quota equilibrium". As outlined in jupyterhub#3080
Related to 760d83e As described in jupyterhub/mybinder.org-deploy#3080.
At 2024-09-02 15:33:00, @arnim and I started a stress test on the GESIS server where we requested the build of 20 new container images. The request to build the containers is visible in the second chart as a big spike. Around the same time, we see an increase of the number of pending pods. The number of pending pods has two plateau. Around 2024-09-02 15:43:00, Kubernetes could not pull the image for any of the 20 new container images. Kubernetes terminated the pods and JupyterHub tried a new launch. This translated in 20 Terminating pods + 20 Pending pods. Later, Kubernetes will terminate the pods because couldn't pull the images and the Kubernetes cluster will have 40 Terminating pods. Because the pods have references to requests to pull the images, Kubernetes garbage collector is unable to delete the pods until that the reference to pull the images is also deleted. Around 2024-09-02 16:06:15, Kubernetes started to download the container images. From Kubernetes Events Logs
@arnim the waiting time is the problem. Maybe because of some API limit. But we have the Kubernetes Events Logs
|
Thank you @rgaiacs for that very detailed description. We are closing in on the problem :) |
Related to #3056
@arnim and I will reduce the maximum number of users that GESIS server can have.
At the moment, GESIS launch quota is 250 as configured in https://github.com/gesiscss/orc2/blob/e2fdacc0f3e0f8ab9aaff943ac78af2a9702e153/helm/gesis-config/production.yaml#L11.
@arnim and I need to find a new value for the "launch quota equilibrium", i.e. the launch quota that allow us to continue operating when downloading container images without getting pending / terminating pods out of control as in the following screenshot.
@arnim and I will also reduce GESIS weigh contribution to the federation, defined in
mybinder.org-deploy/config/prod.yaml
Line 233 in 1b125dd
during the search for the new "launch quota equilibrium".
"Launch Quota Equilibrium" Search Strategy
The follow table will be updated as the search progress.
From 2024-08-29 21:00 UTC+2 until 2024-08-30 09:00 UTC+2
The high number of pending pods were clear.
From 2024-08-30 09:00 UTC+2 until 2024-08-30 21:00 UTC+2
Because it was not possible for me to be online at 2024-08-30 21:00 UTC+2, the change was done a bit earlier.
The increase of pending pods at 2024-08-30 14:56:00 was because of a stress test conducted by @arnim. The server still has problems when it need to download many container images from Docker Hub.
From 2024-08-30 21:00 UTC+2 until 2024-08-31 09:00 UTC+2
I don't know what happen at 2024-08-30 22:42:00 that the number of running pods drop.
Also, I don't know what happen at 2024-08-30 23:46:00 that the number of pending pods increase.
I'm running the same configuration for another 12 hours.
From 2024-08-31 09:00 UTC+2 until 2024-08-31 21:00 UTC+2
From 2024-08-31 21:00 UTC+2 until 2024-09-01 09:00 UTC+2
Around 2024-08-31 23:22:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.
Does this has some correlation with containerd cleaning the local cache?
From 2024-09-01 09:00 UTC+2 until 2024-09-01 21:00 UTC+2
Around 2024-09-01 19:18:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.
Does this has some correlation with containerd cleaning the local cache?
From 2024-09-01 21:00 UTC+2 until 2024-09-02 09:00 UTC+2
Between 2024-09-01 22:00:00 and 2024-09-02 01:00:00, the number of running pods were very low. This was probably something going wrong with the server that it could not download container images.
The text was updated successfully, but these errors were encountered: