Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterLab servers are being killed when jupyterhub is updated #104

Closed
costrouc opened this issue Mar 2, 2022 · 9 comments · Fixed by #106, #124 or #128
Closed

JupyterLab servers are being killed when jupyterhub is updated #104

costrouc opened this issue Mar 2, 2022 · 9 comments · Fixed by #106, #124 or #128
Assignees
Labels
bug Something isn't working

Comments

@costrouc
Copy link
Member

costrouc commented Mar 2, 2022

  • Jupyter notebook servers are killed when restarting JupyterHub
@costrouc costrouc added the bug Something isn't working label Mar 2, 2022
@costrouc
Copy link
Member Author

costrouc commented Mar 2, 2022

Here is where I think I set it up properly https://github.com/Quansight/qhub-hpc/blob/main/roles/jupyterhub/templates/jupyterhub_config.py#L37-L38.

How would you recreate this:

  1. login to jupyterhub and start a jupyterlab session and do some calcs
  2. login to the root node and run systemctl restart jupyterhub this should kill the lab sessions
  3. other approach is to rerun the ansible playbook and do something that restart the jupyterhub server.

@costrouc
Copy link
Member Author

costrouc commented Mar 2, 2022

@Adam-D-Lewis

This comment was marked as outdated.

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Mar 4, 2022

The fix mentioned jupyterhub/jupyterhub#1156 and https://github.com/jupyterhub/jupyterhub/wiki/Run-jupyterhub-as-a-system-service/2b83a97882063c13456f09e81b7c5f7302ba5d33 worked. I added KillMode=process to the service file, and redeployed, but I also needed to manually run sudo systemctl daemon-reload and sudo systemctl restart jupyterhub

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Mar 11, 2022

I thought I needed to setup c.JupyterHub.cleanup_proxy = False as well, but it turns out that I just wasn't waiting long enough before the server was killed. The user server is still killed after 40-45 seconds.

It's also worth noting that if you run systemcl stop jupyterhub the user server is not killed (unless you subsequently start the jupyterhub process)

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Mar 11, 2022

Okay, so what I've seen is the proxy needs to stay up or the user sessions die. Setting c.JupyterHub.cleanup_proxy = False will keep the proxy up when running systemcl stop jupyterhub, but when jupyterhub is restarted, it checks if an existing proxy is still up, and kills it if so so the user sessions are still being killed. The solution is to run the jupyterhub proxy externally. There are two options to do so.

  1. Switch to TraefikTomlProxy and do so (https://jupyterhub-traefik-proxy.readthedocs.io/en/latest/toml.html#example-setup)
  2. Just keep configurable-http-proxy (default proxy), but set it up as it's own systemd service and configure jupyterhub to not start a proxy itself (https://github.com/jupyterhub/configurable-http-proxy)

@costrouc
Copy link
Member Author

So talked with @Adam-D-Lewis about this issue and the problem is that the proxy is currently running as a subprocess of jupyterhub. So whenever the hub goes down so does the http connections which then leads the jupyterlab server killing themselves. The proper way to do this is to ensure that the proxy is running as a separate managed service in systemd.

There are two routes to solve this:

  1. Use configurable-http-proxy

Short term would be to use configurable-http-proxy (the standard way this has been done) see https://github.com/jupyterhub/the-littlest-jupyterhub/tree/125bd1dc186d541585426f7ebf041dd9abad1845/tljh/systemd-units.

A few of the steps needed:

I would estimate that this is 10 hours of work.

  1. Traefik v2 integration

There are so many clients and projects where we have seen a need for traefik v2 support for jupyterhub. And it is a popular issue jupyterhub/traefik-proxy#97. There are many partially completed PRs and it needs someone to push it over the line.

This is holding back several open source projects: the littlest jupyterhub, zero to jupyterhub, qhub, and qhub-hpc (now).

@sjdemartini
Copy link
Contributor

This issue unfortunately needs to be re-opened. Jobs running on workers are getting killed every time JupyterHub restarts, while jobs running on the master don't get killed. The above testing seemingly only used a master-node setup. I'd consider this high priority. Thanks in advance for investigating.

@sjdemartini
Copy link
Contributor

For what it's worth, the error message is showing:

[email protected]'s server never showed up at http://hpc-master-node:42479/user/[email protected]/test2/ after 30 seconds. Giving up. 

It seems fishy that it's looking at hpc-master-node instead of the worker node name, but I don't know if that's expected.

@costrouc costrouc reopened this Apr 11, 2022
ericdwang added a commit to ericdwang/qhub-hpc that referenced this issue Apr 11, 2022
Follow-up to nebari-dev#106 and fixes nebari-dev#104 (again)

We discovered in the JupyterHub logs that it was trying to contact the
master node for jobs scheduled on worker nodes which was incorrect and
led to them getting killed:

```
Notebook server job 157 started at hpc-worker-02:52649
(JupyterHub restart)
server never showed up at http://hpc-master-node:52649
```

This fixes the problem by preserving `self.server.ip` similar to
`self.server.port` in `QHubHPCSpawnerBase.poll()`.
costrouc pushed a commit that referenced this issue Apr 11, 2022
Follow-up to #106 and fixes #104 (again)

We discovered in the JupyterHub logs that it was trying to contact the
master node for jobs scheduled on worker nodes which was incorrect and
led to them getting killed:

```
Notebook server job 157 started at hpc-worker-02:52649
(JupyterHub restart)
server never showed up at http://hpc-master-node:52649
```

This fixes the problem by preserving `self.server.ip` similar to
`self.server.port` in `QHubHPCSpawnerBase.poll()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment