Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When 1 remote kernel has stopped no files are displayed in the Files tab -> sessions REST API returns total failure as long as just 1 remote kernel API fails #5057

Closed
stevehaertel opened this issue Nov 15, 2019 · 3 comments

Comments

@stevehaertel
Copy link

Environment:
Linux [hostname] 2.6.32-754.23.1.el6.x86_64 #1 SMP Tue Sep 17 09:46:55 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

notebook = 6.0.2 (but same thing happens on 6)
jupyter enterprise gateway = 2.0.0

Problem
When I use Jupyter to launch any number of Spark kernels, if the Spark application is stopped outside of Jupyter, upon logging in, 0 Files are displayed in the Files tab. If I take a look at my networking tab in my browser, I can see that the "sessions" REST API call is failing. I'm not exactly sure what the sessions API is doing (hopefully you can help!) but based on my JEG log output, it looks like Jupyter is calling JEG REST APIs to get info for each of the kernels. If just 1 of those kernel API calls fails, then the entire sessions REST API returns a 504 ({"message": "Error attempting to connect to Gateway server url 'https://[hostname]:8888'. Ensure gateway url is valid and the Gateway instance is running.", "reason": null})

Question
Would it be possible to return the partial list of kernels that it CAN find instead of an entire failure?

JEG log where you can see the calls that Jupyter is doing for multiple kernels
Starting IPython kernel for Spark Cluster mode on behalf of user shaertel

[I 2019-11-15 12:50:35.284 EnterpriseGatewayApp] ApplicationID: 'app-20191115125034-0007-0cff530c-4325-4688-b204-c0229fd2869a' assigned for KernelID: '8cc66e44-8238-4454-9bb5-2a0cf0074ebe', state: WAITING, 14.0 seconds after starting.
[I 2019-11-15 12:50:35.341 EnterpriseGatewayApp] Kernel started: 8cc66e44-8238-4454-9bb5-2a0cf0074ebe
[I 191115 12:50:35 web:2246] 201 POST /api/kernels (9.21.58.126) 14017.63ms
[I 191115 12:50:35 web:2246] 200 GET /api/kernels/8cc66e44-8238-4454-9bb5-2a0cf0074ebe (9.21.58.126) 2.50ms
[I 191115 12:50:35 web:2246] 200 GET /api/kernels/8cc66e44-8238-4454-9bb5-2a0cf0074ebe (9.21.58.126) 0.72ms
[W 2019-11-15 12:50:35.456 EnterpriseGatewayApp] No session ID specified
[I 191115 12:50:35 web:2246] 101 GET /api/kernels/8cc66e44-8238-4454-9bb5-2a0cf0074ebe/channels (9.21.58.126) 14.12ms
[I 2019-11-15 12:50:42.620 EnterpriseGatewayApp] KernelRestarter: restarting kernel (1/5), keep random ports
[W 2019-11-15 12:50:42.621 EnterpriseGatewayApp] Remote kernel (d389b3c6-a72b-4865-821d-974a7bcccf06) will not be automatically restarted since there are no clients connected at this time.
[I 2019-11-15 12:50:42.746 EnterpriseGatewayApp] Kernel shutdown: d389b3c6-a72b-4865-821d-974a7bcccf06
[I 2019-11-15 12:50:46.326 EnterpriseGatewayApp] Starting buffering for 8cc66e44-8238-4454-9bb5-2a0cf0074ebe:4cf9dc54-4f65b5bdedd2ae520723a69c
[I 191115 12:50:49 web:2246] 200 GET /api/kernelspecs (9.21.58.126) 11.16ms
[W 191115 12:50:49 web:1782] 404 GET /api/kernels/d389b3c6-a72b-4865-821d-974a7bcccf06 (9.21.58.126): Kernel does not exist: d389b3c6-a72b-4865-821d-974a7bcccf06
[W 191115 12:50:49 web:2246] 404 GET /api/kernels/d389b3c6-a72b-4865-821d-974a7bcccf06 (9.21.58.126) 3.34ms
@kevin-bates
Copy link
Member

Hi @stevehaertel. This is a bizarre day as its the second occurrence (see #5055) of a gateway-related issue that should have been witnessed before and leads me to believe there's been some kind of change or something has side affected things such that these issues are now surfacing. That said, I don't tend to have kernel issues or let culling occur very often, so perhaps this is just a humble reminder. 😄

On the bright side, if I run with the updated file in #5055, I don't see this issue on my Notebook.

I can reproduce your issue after a kernel has been culled (which may be a similar scenario in these failing cases you have). After culling, the /api/sessions request, which ultimately hits the EG server to collect the running kernel models, fails but due to the error handling (fixed in #5055), causes the request from the browser to fail (I presume, I'm not a front-end person). Since the directory listing always follows the /api/sessions request the contents request is not satisfied and, thus, the Files tab is empty. Here are the two NB log entries from my system - when /api/sessions succeeds ...

[D 12:09:41.057 NotebookApp] 200 GET /api/sessions?_=1573848580376 (::1) 634.28ms
[D 12:09:41.069 NotebookApp] 200 GET /api/contents/alice/YARN?type=directory&_=1573848580378 (::1) 6.42ms

I've gone ahead and attached a build from @shuichiro-makigaki's branch in hopes that you can take this for a spin in your configuration. notebook-7.0.0.dev0-py3-none-any.whl.zip

Based on the output in your EG log, it looks like your kernels are failing to start in your Spark cluster. I'd be happy to help with those issues in either the EG gitter channel or via an issue in the EG repo - if you like.

@stevehaertel
Copy link
Author

@kevin-bates Hey it worked! :D In my test, after I manually kill 1 running spark app, I go back into my notebook, and I can see both files there, and I can go into the kernel that I had originally stopped and go ahead and start another one with no problem :)
@shuichiro-makigaki うまく行った!ありがとうございました

@kevin-bates
Copy link
Member

Fantastic! Thank you for the update. Nice fix @shuichiro-makigaki!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants