Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure SparkSession with lazy initialization #813

Closed
jpugliesi opened this issue May 6, 2020 · 3 comments
Closed

Configure SparkSession with lazy initialization #813

jpugliesi opened this issue May 6, 2020 · 3 comments

Comments

@jpugliesi
Copy link

Description

I'm running JEG with a Pyspark YARN Cluster mode kernel (on AWS EMR, for what its worth).

While having the SparkSession automatically created (with --RemoteProcessProxy.spark-context-initialization-mode), In some cases, our PySpark Notebook users would like to define application-specific SparkSession configuration. (i.e. spark.executor.instances).

What is the recommended way to configure SparkSession configuration for a lazily initialized SparkContext?

The Spark docs mention its possible to set dynamic configuration at runtime - is this the best approach (it seems slightly less than ideal given it would require users to configure settings outside of the notebook). That said, is it possible for notebook users to define additional spark-submit options at kernel launch?

One hacky approach I've tried is stopping and re-creating the SparkSession in the notebook, but that fails with an error as shown in the below Screenshots/Logs:

Screenshots / Logs

image

Here are the recent logs on the EG - note the only interesting log is the error on the second to last line.

[D 2020-05-06 15:50:21.963 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels
[W 2020-05-06 15:50:21.964 EnterpriseGatewayApp] No session ID specified
[I 200506 15:50:21 web:2250] 101 GET /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels (172.21.60.214) 1.91ms
[D 2020-05-06 15:50:21.964 EnterpriseGatewayApp] Opening websocket /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Getting buffer for b300ef0f-048e-4282-b61b-dd15ef13c8f3
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Clearing buffer for b300ef0f-048e-4282-b61b-dd15ef13c8f3
[I 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Discarding 3 buffered messages for b300ef0f-048e-4282-b61b-dd15ef13c8f3:21054be9-2992eb700a4ec12e97f12c6a
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:35979
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:49753
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:48209
[D 2020-05-06 15:50:21.966 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:50889
[D 2020-05-06 15:50:21.998 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:50:21.999 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:52:51.169 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:52:51.170 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:52:51.744 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_result
[D 2020-05-06 15:52:51.745 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:53:32.429 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:53:32.429 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:53:32.909 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:53:40.905 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:53:40.906 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:53:41.389 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: error
[D 2020-05-06 15:53:41.393 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)

Environment

  • Enterprise Gateway Version: 2.1.1
  • Platform: YARN
  • Others: Jupyter 6.0.3
@jpugliesi jpugliesi changed the title Configure SparkSession with lazy initialization. Configure SparkSession with lazy initialization May 6, 2020
@kevin-bates
Copy link
Member

Thanks for opening the issue. I agree things aren't where they should be for something like this. What you want is the ability to parameterize kernel launches, but that requires changes throughout the stack (and has been proposed: jupyter/enhancement-proposals#46, but is a little ways off).

The recommended way to address this would be to create separate kernelspecs with the desired parameters baked into the SPARK_OPTS and where the name of the directory (i.e., the kernel name) infers the parameters.

Also note that ALL KERNEL_-prefixed environment variables will flow from the notebook client into the kernelspec. But that's more of a static thing that's only useful if your user wants the same parameters for each notebook/kernel-name combination. This approach would reduce the number of kernelspec instances - although I'd recommend the kernelspec-per-config approach since its easier on the end-user.

@jpugliesi
Copy link
Author

@kevin-bates Thank you for the quick response and confirming the existing functionality. I'll check out the linked JEP

@kevin-bates
Copy link
Member

I'm going to close this issue. If you feel your question was not adequately addressed, please feel free to re-open it at that time. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants