You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running JEG with a Pyspark YARN Cluster mode kernel (on AWS EMR, for what its worth).
While having the SparkSession automatically created (with --RemoteProcessProxy.spark-context-initialization-mode), In some cases, our PySpark Notebook users would like to define application-specific SparkSession configuration. (i.e. spark.executor.instances).
What is the recommended way to configure SparkSession configuration for a lazily initialized SparkContext?
The Spark docs mention its possible to set dynamic configuration at runtime - is this the best approach (it seems slightly less than ideal given it would require users to configure settings outside of the notebook). That said, is it possible for notebook users to define additional spark-submit options at kernel launch?
One hacky approach I've tried is stopping and re-creating the SparkSession in the notebook, but that fails with an error as shown in the below Screenshots/Logs:
Screenshots / Logs
Here are the recent logs on the EG - note the only interesting log is the error on the second to last line.
[D 2020-05-06 15:50:21.963 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels
[W 2020-05-06 15:50:21.964 EnterpriseGatewayApp] No session ID specified
[I 200506 15:50:21 web:2250] 101 GET /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels (172.21.60.214) 1.91ms
[D 2020-05-06 15:50:21.964 EnterpriseGatewayApp] Opening websocket /api/kernels/b300ef0f-048e-4282-b61b-dd15ef13c8f3/channels
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Getting buffer for b300ef0f-048e-4282-b61b-dd15ef13c8f3
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Clearing buffer for b300ef0f-048e-4282-b61b-dd15ef13c8f3
[I 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Discarding 3 buffered messages for b300ef0f-048e-4282-b61b-dd15ef13c8f3:21054be9-2992eb700a4ec12e97f12c6a
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:35979
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:49753
[D 2020-05-06 15:50:21.965 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:48209
[D 2020-05-06 15:50:21.966 EnterpriseGatewayApp] Connecting to: tcp://10.0.228.187:50889
[D 2020-05-06 15:50:21.998 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:50:21.999 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:52:51.169 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:52:51.170 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:52:51.744 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_result
[D 2020-05-06 15:52:51.745 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:53:32.429 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:53:32.429 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:53:32.909 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
[D 2020-05-06 15:53:40.905 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (busy)
[D 2020-05-06 15:53:40.906 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: execute_input
[D 2020-05-06 15:53:41.389 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: error
[D 2020-05-06 15:53:41.393 EnterpriseGatewayApp] activity on b300ef0f-048e-4282-b61b-dd15ef13c8f3: status (idle)
Environment
Enterprise Gateway Version: 2.1.1
Platform: YARN
Others: Jupyter 6.0.3
The text was updated successfully, but these errors were encountered:
jpugliesi
changed the title
Configure SparkSession with lazy initialization.
Configure SparkSession with lazy initialization
May 6, 2020
Thanks for opening the issue. I agree things aren't where they should be for something like this. What you want is the ability to parameterize kernel launches, but that requires changes throughout the stack (and has been proposed: jupyter/enhancement-proposals#46, but is a little ways off).
The recommended way to address this would be to create separate kernelspecs with the desired parameters baked into the SPARK_OPTS and where the name of the directory (i.e., the kernel name) infers the parameters.
Also note that ALL KERNEL_-prefixed environment variables will flow from the notebook client into the kernelspec. But that's more of a static thing that's only useful if your user wants the same parameters for each notebook/kernel-name combination. This approach would reduce the number of kernelspec instances - although I'd recommend the kernelspec-per-config approach since its easier on the end-user.
Description
I'm running JEG with a Pyspark YARN Cluster mode kernel (on AWS EMR, for what its worth).
While having the SparkSession automatically created (with
--RemoteProcessProxy.spark-context-initialization-mode
), In some cases, our PySpark Notebook users would like to define application-specific SparkSession configuration. (i.e.spark.executor.instances
).What is the recommended way to configure SparkSession configuration for a lazily initialized SparkContext?
The Spark docs mention its possible to set dynamic configuration at runtime - is this the best approach (it seems slightly less than ideal given it would require users to configure settings outside of the notebook). That said, is it possible for notebook users to define additional
spark-submit
options at kernel launch?One hacky approach I've tried is stopping and re-creating the SparkSession in the notebook, but that fails with an error as shown in the below Screenshots/Logs:
Screenshots / Logs
Here are the recent logs on the EG - note the only interesting log is the
error
on the second to last line.Environment
The text was updated successfully, but these errors were encountered: