-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JupyterLab servers are being killed when jupyterhub is updated #104
Comments
Here is where I think I set it up properly https://github.com/Quansight/qhub-hpc/blob/main/roles/jupyterhub/templates/jupyterhub_config.py#L37-L38. How would you recreate this:
|
How to run qhub-hpc https://github.com/Quansight/qhub-hpc/blob/main/docs/installation.md |
This comment was marked as outdated.
This comment was marked as outdated.
The fix mentioned jupyterhub/jupyterhub#1156 and https://github.com/jupyterhub/jupyterhub/wiki/Run-jupyterhub-as-a-system-service/2b83a97882063c13456f09e81b7c5f7302ba5d33 worked. I added KillMode=process to the service file, and redeployed, but I also needed to manually run |
I thought I needed to setup c.JupyterHub.cleanup_proxy = False as well, but it turns out that I just wasn't waiting long enough before the server was killed. The user server is still killed after 40-45 seconds. It's also worth noting that if you run |
Okay, so what I've seen is the proxy needs to stay up or the user sessions die. Setting
|
So talked with @Adam-D-Lewis about this issue and the problem is that the proxy is currently running as a subprocess of jupyterhub. So whenever the hub goes down so does the http connections which then leads the jupyterlab server killing themselves. The proper way to do this is to ensure that the proxy is running as a separate managed service in systemd. There are two routes to solve this:
Short term would be to use configurable-http-proxy (the standard way this has been done) see https://github.com/jupyterhub/the-littlest-jupyterhub/tree/125bd1dc186d541585426f7ebf041dd9abad1845/tljh/systemd-units. A few of the steps needed:
I would estimate that this is 10 hours of work.
There are so many clients and projects where we have seen a need for traefik v2 support for jupyterhub. And it is a popular issue jupyterhub/traefik-proxy#97. There are many partially completed PRs and it needs someone to push it over the line. This is holding back several open source projects: the littlest jupyterhub, zero to jupyterhub, qhub, and qhub-hpc (now). |
This issue unfortunately needs to be re-opened. Jobs running on workers are getting killed every time JupyterHub restarts, while jobs running on the master don't get killed. The above testing seemingly only used a master-node setup. I'd consider this high priority. Thanks in advance for investigating. |
For what it's worth, the error message is showing:
It seems fishy that it's looking at |
Follow-up to nebari-dev#106 and fixes nebari-dev#104 (again) We discovered in the JupyterHub logs that it was trying to contact the master node for jobs scheduled on worker nodes which was incorrect and led to them getting killed: ``` Notebook server job 157 started at hpc-worker-02:52649 (JupyterHub restart) server never showed up at http://hpc-master-node:52649 ``` This fixes the problem by preserving `self.server.ip` similar to `self.server.port` in `QHubHPCSpawnerBase.poll()`.
Follow-up to #106 and fixes #104 (again) We discovered in the JupyterHub logs that it was trying to contact the master node for jobs scheduled on worker nodes which was incorrect and led to them getting killed: ``` Notebook server job 157 started at hpc-worker-02:52649 (JupyterHub restart) server never showed up at http://hpc-master-node:52649 ``` This fixes the problem by preserving `self.server.ip` similar to `self.server.port` in `QHubHPCSpawnerBase.poll()`.
Closes #104 Thanks @sjdemartini for catching this fix in https://jupyterhub.readthedocs.io/en/stable/changelog.html#bugs-fixed.
Closes #104 Thanks @sjdemartini for catching this fix in https://jupyterhub.readthedocs.io/en/stable/changelog.html#bugs-fixed.
The text was updated successfully, but these errors were encountered: