Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVH hub restarting ~2000 times in ~20 days - 5 consecutive failed startups #2642

Open
consideRatio opened this issue May 24, 2023 · 2 comments

Comments

@consideRatio
Copy link
Member

image

[W 2023-05-24 18:26:21.361 JupyterHub base:1030] 4 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2023-05-24 18:26:21.361 JupyterHub gen:630] Exception in Future <Task finished name='Task-11612' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py:954> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py", line 961, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 850, in spawn
        raise e
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 747, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    asyncio.exceptions.TimeoutError: Timeout
    
[I 2023-05-24 18:26:21.362 JupyterHub log:186] 200 GET /hub/api/users/eviljasi-arbeitspaket11-c0shamc0/server/progress ([email protected]) 430943.94ms
[C 2023-05-24 18:26:21.508 JupyterHub base:1037] Aborting due to 5 consecutive spawn failures
[E 2023-05-24 18:26:21.508 JupyterHub gen:630] Exception in Future <Task finished name='Task-12697' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py:954> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py", line 961, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 850, in spawn
        raise e
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 747, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    asyncio.exceptions.TimeoutError: Timeout
    
[I 2023-05-24 18:26:21.509 JupyterHub log:186] 200 GET /hub/api/users/jupyterlab-jupyterlab-demo-a6r1ksnc/server/progress ([email protected]) 344035.05ms
[I 2023-05-24 18:26:22.266 JupyterHub roles:238] Adding role user for User: jupyterlab-jupyterlab-demo-zz2ecopn
[I 2023-05-24 18:26:22.295 JupyterHub log:186] 201 POST /hub/api/users/jupyterlab-jupyterlab-demo-zz2ecopn ([email protected]) 43.89ms
[I 2023-05-24 18:26:22.349 JupyterHub provider:651] Creating oauth client jupyterhub-user-jupyterlab-jupyterlab-demo-zz2ecopn
[W 2023-05-24 18:26:22.391 JupyterHub spawner:3071] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token
[W 2023-05-24 18:26:22.410 JupyterHub utils:77] 'pod.spec.restart_policy' current value: 'OnFailure' is overridden with 'Never', which is the value of 'extra_pod_config.restart_policy'.
[I 2023-05-24 18:26:22.411 JupyterHub log:186] 202 POST /hub/api/users/jupyterlab-jupyterlab-demo-zz2ecopn/servers/ ([email protected]) 106.82ms
[I 2023-05-24 18:26:22.411 JupyterHub spawner:2469] Attempting to create pod jupyter-jupyterlab-2djupyterlab-2ddemo-2dzz2ecopn, with timeout 3
Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fac349b91f0>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7fac35dab440>
RuntimeError: coroutine ignored GeneratorExit
@consideRatio
Copy link
Member Author

It seems that a lot of pod is stucking pulling the image without erroring or succeeding. Even pods in a terminating state aren't terminating because they are stuck pulling still.

kubectl describe pod jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9

Events:
  Type    Reason     Age    From                 Message
  ----    ------     ----   ----                 -------
  Normal  Scheduled  2m38s  ovh2-user-scheduler  Successfully assigned ovh2/jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9 to user-202211a-node-6f699a
  Normal  Pulled     2m38s  kubelet              Container image "jupyterhub/mybinder.org-tc-init:2020.12.4-0.dev.git.4289.h140cef52" already present on machine
  Normal  Created    2m38s  kubelet              Created container tc-init
  Normal  Started    2m37s  kubelet              Started container tc-init
  Normal  Pulling    2m37s  kubelet              Pulling image "2lmrrh8f.gra7.container-registry.ovh.net/mybinder-builds/r2d-g5b5b759binderhub-2dci-2drepos-2dcached-2dminimal-2ddockerfile-c90b2b:596b52f10efb0c9befc0c4ae850cc5175297d71c"

@minrk
Copy link
Member

minrk commented May 25, 2023

OVH harbor registry appears to be having stability issues again, which I think is the ultimate cause. I've contacted OVH support about it.

I think we should consier moving OVH to using an external registry on e.g. quay.io. Downside: images are public and we need to be more proactive about cleaning and better support requesting deletion because e.g. statements about "if you unpublish the ref, your files are inaccessible" are not true at all if the build cache is public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants