Reduce the number of users that GESIS server can have #3080

rgaiacs · 2024-08-29T18:25:19Z

Related to #3056

@arnim and I will reduce the maximum number of users that GESIS server can have.

At the moment, GESIS launch quota is 250 as configured in https://github.com/gesiscss/orc2/blob/e2fdacc0f3e0f8ab9aaff943ac78af2a9702e153/helm/gesis-config/production.yaml#L11.

@arnim and I need to find a new value for the "launch quota equilibrium", i.e. the launch quota that allow us to continue operating when downloading container images without getting pending / terminating pods out of control as in the following screenshot.

@arnim and I will also reduce GESIS weigh contribution to the federation, defined in

mybinder.org-deploy/config/prod.yaml

Line 233 in 1b125dd

weight: 100

during the search for the new "launch quota equilibrium".

"Launch Quota Equilibrium" Search Strategy

The follow table will be updated as the search progress.

Date	Time	Launch Quota	Federation Contribution	Notes
2024-08-29	21:00 UTC+2	40	60	To clean the pending / terminating pods
2024-08-30	09:00 UTC+2	100	60
2024-08-30	21:00 UTC+2	120	60
2024-08-31	09:00 UTC+2	120	60
2024-08-31	21:00 UTC+2	120	60
2024-09-01	09:00 UTC+2	120	60
2024-09-02	09:00 UTC+2	120	60
2024-09-02	21:00 UTC+2	??	60
2024-09-03	21:00 UTC+2	??	60
2024-09-03	21:00 UTC+2	??	60

From 2024-08-29 21:00 UTC+2 until 2024-08-30 09:00 UTC+2

The high number of pending pods were clear.

From 2024-08-30 09:00 UTC+2 until 2024-08-30 21:00 UTC+2

Because it was not possible for me to be online at 2024-08-30 21:00 UTC+2, the change was done a bit earlier.

The increase of pending pods at 2024-08-30 14:56:00 was because of a stress test conducted by @arnim. The server still has problems when it need to download many container images from Docker Hub.

From 2024-08-30 21:00 UTC+2 until 2024-08-31 09:00 UTC+2

I don't know what happen at 2024-08-30 22:42:00 that the number of running pods drop.

Also, I don't know what happen at 2024-08-30 23:46:00 that the number of pending pods increase.

I'm running the same configuration for another 12 hours.

From 2024-08-31 09:00 UTC+2 until 2024-08-31 21:00 UTC+2

From 2024-08-31 21:00 UTC+2 until 2024-09-01 09:00 UTC+2

Around 2024-08-31 23:22:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.

Does this has some correlation with containerd cleaning the local cache?

From 2024-09-01 09:00 UTC+2 until 2024-09-01 21:00 UTC+2

Around 2024-09-01 19:18:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.

Does this has some correlation with containerd cleaning the local cache?

From 2024-09-01 21:00 UTC+2 until 2024-09-02 09:00 UTC+2

Between 2024-09-01 22:00:00 and 2024-09-02 01:00:00, the number of running pods were very low. This was probably something going wrong with the server that it could not download container images.

…h quota equilibrium". As outlined in jupyterhub#3080

Related to 760d83e As described in jupyterhub/mybinder.org-deploy#3080.

Related to jupyterhub/mybinder.org-deploy#3080

rgaiacs · 2024-09-02T15:13:14Z

At 2024-09-02 15:33:00, @arnim and I started a stress test on the GESIS server where we requested the build of 20 new container images.

The request to build the containers is visible in the second chart as a big spike. Around the same time, we see an increase of the number of pending pods. The number of pending pods has two plateau. Around 2024-09-02 15:43:00, Kubernetes could not pull the image for any of the 20 new container images. Kubernetes terminated the pods and JupyterHub tried a new launch. This translated in 20 Terminating pods + 20 Pending pods. Later, Kubernetes will terminate the pods because couldn't pull the images and the Kubernetes cluster will have 40 Terminating pods.

Because the pods have references to requests to pull the images, Kubernetes garbage collector is unable to delete the pods until that the reference to pull the images is also deleted.

Around 2024-09-02 16:06:15, Kubernetes started to download the container images. From Kubernetes Events Logs

Container image	Download time	Waiting time
gesiscss/binder-r2d-g5b5b759-dgothrek-2dipyaggrid-aa11bb:94f3d74bce0a0a0434248824f1071998c9c105fa	11.419s	0
gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1	3.941s	15min
gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1	933ms	15min
"gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9	910ms	28min

@arnim the waiting time is the problem. Maybe because of some API limit. But we have the imagePullSecret configured. I will try to check the metadata of the pods tomorrow.

Kubernetes Events Logs

27m                 Normal    Pulled             Pod/jupyter-dgothrek-2dipyaggrid-2d5hn841ym                        Successfully pulled image "gesiscss/binder-r2d-g5b5b759-dgothrek-2dipyaggrid-aa11bb:94f3d74bce0a0a0434248824f1071998c9c105fa" in 11.419s (11.42s including waiting)
32m                 Normal    Pulled             Pod/jupyter-eichstaedtptb-2dmontecarlohandson-2d4msqp66x           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1" in 3.941s (16m52.737s including waiting)
32m                 Normal    Pulled             Pod/jupyter-eichstaedtptb-2dmontecarlohandson-2dts191lx3           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1" in 933ms (16m34.432s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d1yh4r76x              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 910ms (28m50.629s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d4jomepwv              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:854b5c4dc0a9d8702b651730dcb73dbf1824ebab" in 3.673s (31m21.81s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d6uvfy8gm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 3.175s (31m24.983s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d86sercph              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:dc64e6925e035e4b37091d4c30ca8b0571e5ac17" in 4.791s (27m36.887s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d8s0ug3qu              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:a8e94343385da7cea277684fa73a9a8e05e9b116" in 3.499s (31m37.303s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d8wb2y3na              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 911ms (28m28.346s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d9ycn1fr2              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 924ms (28m47.51s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2daucf374h              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 915ms (36m32.659s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dc02g05lb              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:e287c8967d444edca30f6af4fbb60adfd02f72b4" in 3.091s (31m39.378s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2ddm1w12hm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 1.002s (28m25.668s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2de9qli7tn              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 990ms (28m50.25s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2deu3ebjvl              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:c3c3073488f9f0d3627daf989a4e0426ea8bc7f0" in 2.833s (31m45.505s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dfkdoj0av              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:4e1e19ea790364eb539e7a9e89f6fa8aaf30d281" in 876ms (28m51.123s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dlmbwcn06              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 979ms (36m31.747s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmmysoi22              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 929ms (28m24.669s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmpp2elqx              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:bf354d4890a11183f6fb7d2bf8ab62e4f3773234" in 2.651s (30m27.059s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmrrs413l              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 892ms (28m26.557s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmrynnzv2              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 890ms (28m49.262s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dnmdrkpmc              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:4e1e19ea790364eb539e7a9e89f6fa8aaf30d281" in 2.643s (29m9.898s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dnxld8wi6              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 930ms (36m33.587s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpayjz2ut              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 2.814s (31m31.799s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpoh4qhom              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:f90079cb5d89633cbba57b5d65cd2c5df17df726" in 2.675s (31m42.045s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpy15hb30              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:c3c3073488f9f0d3627daf989a4e0426ea8bc7f0" in 866ms (28m48.375s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dqq7imv0z              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:fb5289a410eb1e13f64aff806d0e0daa4b7bbb91" in 2.767s (29m7.258s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dsfecfjr3              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:f8ae18112210b0f84f826d78d7975cbe648e7416" in 2.734s (30m21.966s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dvuja3dqm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:453f79ca0cdd64b1820d7b38e344b2d5b4a3fab8" in 2.451s (30m24.415s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dwnhz1pe6              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 2.666s (31m42.678s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dx0o9ibpl              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 3.031s (31m33.809s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dx6pvsjpx              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:11d3e7147788113abe67dc0b05ac9ea3cced67be" in 2.873s (31m47.355s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dyt0ltbct              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:5cdaf52101076fa1d85b2852aba9e0b018c1fde1" in 2.657s (31m26.613s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dzn3qtbmn              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:6780067c979309d8aeb6dde8d6b7a138c831f38f" in 3.414s (31m29.011s including waiting)
24m                 Normal    Pulled             Pod/jupyter-giswqs-2dwhitebox-2drust-2dbinder-2d30nt0ab6           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-giswqs-2dwhitebox-2drust-2dbinder-fd7fcc:5f82299166f6e41bbf13dafa18c91cf78cc66dd9" in 1m57.875s (1m57.875s including waiting)

arnim · 2024-09-02T16:45:06Z

Thank you @rgaiacs for that very detailed description. We are closing in on the problem :)

rgaiacs self-assigned this Aug 29, 2024

rgaiacs added a commit to gesiscss/mybinder.org-deploy that referenced this issue Aug 29, 2024

Reduce GESIS contribution during during the search for the new "launc…

570c167

…h quota equilibrium". As outlined in jupyterhub#3080

rgaiacs mentioned this issue Aug 29, 2024

Reduce GESIS contribution during during the search for the new "launch quota equilibrium" #3081

Merged

rgaiacs added a commit to gesiscss/orc2 that referenced this issue Aug 30, 2024

Increase total quota

df5ce81

Related to 760d83e As described in jupyterhub/mybinder.org-deploy#3080.

rgaiacs added a commit to gesiscss/orc2 that referenced this issue Aug 30, 2024

Increase the number of users to 120

85e5ca1

Related to jupyterhub/mybinder.org-deploy#3080

rgaiacs mentioned this issue Sep 4, 2024

Re-think GESIS server #3087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of users that GESIS server can have #3080

Reduce the number of users that GESIS server can have #3080

rgaiacs commented Aug 29, 2024 •

edited

Loading

rgaiacs commented Sep 2, 2024

arnim commented Sep 2, 2024

Reduce the number of users that GESIS server can have #3080

Reduce the number of users that GESIS server can have #3080

Comments

rgaiacs commented Aug 29, 2024 • edited Loading

"Launch Quota Equilibrium" Search Strategy

From 2024-08-29 21:00 UTC+2 until 2024-08-30 09:00 UTC+2

From 2024-08-30 09:00 UTC+2 until 2024-08-30 21:00 UTC+2

From 2024-08-30 21:00 UTC+2 until 2024-08-31 09:00 UTC+2

From 2024-08-31 09:00 UTC+2 until 2024-08-31 21:00 UTC+2

From 2024-08-31 21:00 UTC+2 until 2024-09-01 09:00 UTC+2

From 2024-09-01 09:00 UTC+2 until 2024-09-01 21:00 UTC+2

From 2024-09-01 21:00 UTC+2 until 2024-09-02 09:00 UTC+2

rgaiacs commented Sep 2, 2024

Kubernetes Events Logs

arnim commented Sep 2, 2024

rgaiacs commented Aug 29, 2024 •

edited

Loading