Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP) #1105

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rahul26goyal
Copy link
Contributor

@rahul26goyal rahul26goyal commented Jun 7, 2022

Problem Statement

With JEG running on a remote machine and handling the kernel life cycle, Notebook users can not longer change the Kernels specs / properties locally which would update the configuration with which Spark kernel comes up. There are various use cases where users want to play around and experiment with different spark configuration to arrive at the final configs which best suit their workload. These configs also might vary from one notebook to another based on the workload the notebook is doing. JEG is also used as we multi-tenant service where each user might want to tweak the kernel based on his/ her scenario.
Thus, there is a need for users to be able to update the kernel / spark properties at runtime from the notebook.

Feature Description

The changes proposed in this PR are to add support for a well known magic %%configure -f {} which allows Notebook users to change the spark properties at runtime without having to create / update any kernel spec file. This would allow users to change spark driver, executor resources (like cores, memory), enable / disable spark configuration etc.

Example: The below snipped can be copied into a notebook cell to update the various spark properties associated with the current kernel.

%%configure -f 
{
  "driverMemory": "3G",
  "driverCores" : "2",
  "executorMemory" : "3G",
  "executorCores" : "2",
  "numExecutors" : 5,
  "conf" : {
      "spark.kubernetes.driver.label.test": "test-label"
  }
}

Implementation Details

The below are the changes made at the high level:

  1. I have introduced a new API on JEG POST api/configure/<kernel_id> which accepts a payload similar to create kernel API. This API currently support updating the ["KERNEL_EXTRA_SPARK_OPTS", "KERNEL_LAUNCH_TIMEOUT"] env variables.
  2. The above API tries to restart the same Kernel with the updated configuration. This is done because we want to keep the kernel_id same and want to give a smooth end user experience.
  3. Once the old kernel goes away and a replacement comes up, we also need to refresh the ZMQ sockets to establish the connection with the new kernel so that existing active websocket connection from notebook / jupyterlab UI clients can continue to work. There hooks introduced to handle the same.
  4. Further, in order to complete the usual Jupyter Kernel messaging handshake, we fire the missing zmq messages from JEG to the websocket clients. Example: In order to mark the completion on the current cell, we need to send the exec_reply message and to mark the kernel idle, we need to kernel status=idle messages etc . These messages are pre-generated on the kernel and sent to JEG while making the API call to refresh the kernel.

I will update more details about the changes and add some diagrams.

Testing

  • Basic sanity testing done.

Note

Opening this PR for some early feedback and discussion on the changes.

Comment on lines +248 to +253
response = requests.post(
self.update_kernel_url,
data=json.dumps(payload_dict, default=str),
headers=self.headers,
verify=False,
)

Check failure

Code scanning / CodeQL

Request without certificate validation

Call to requests.post with verify=[False](1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should address this alert somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure...we can discuss the approach to resolve this.

logger.debug(f"Payload to refresh: {magic_payload}")
result = self.update_kernel(magic_payload)
return result
return "Done"

Check warning

Code scanning / CodeQL

Unreachable code

Unreachable statement.
else:
logger.error(f"Either key or value is not defined. {env_key}, {env_value}")

def update_kernel(self, payload_dict):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@@ -0,0 +1,296 @@
import base64
import json

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from'

Module 'json' is imported with both 'import' and 'import from'
@@ -0,0 +1,296 @@
import base64

Check notice

Code scanning / CodeQL

Unused import

Import of 'base64' is not used.
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting @rahul26goyal - thank you. I'm not sure we should hold the 3.0 release for this as this seems additive. Most of the comments are little things.

One thing I'm not sure about is that this only applies to Python kernels, and only those configured for Spark (at the moment, although it could be more general).

I'd also like to look into incorporating the RemoteZMQChannelsHandler in a general way. I haven't given it a real close look, but do you think that could be useful even outside this particular "configure/restart" feature?

payload = self.get_json_body()
self.log.debug(f"Request payload: {payload}")
if payload is None:
self.log.info("Empty payload in the request body.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These info messages aren't necessary since the message returned to the client will indicate where it came from.

self.log.info("Empty payload in the request body.")
self.finish(
json.dumps(
{"message": "Empty payload received. No operation performed on the Kernel."},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"message": "Empty payload received. No operation performed on the Kernel."},
{"message": f"Empty payload received. No operation performed on kernel: {kernel_id}"},

self.log.info("Empty payload in the request body.")
self.finish(
json.dumps(
{"message": "Empty payload received. No operation performed on the Kernel."},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"message": "Empty payload received. No operation performed on the Kernel."},
{"message": f"Empty payload received. No operation performed on kernel: {kernel_id}"},

"An existing restart request is still in progress. Skipping this request."
)
raise web.HTTPError(
400, "Duplicate Kernel update request received for Id: %s." % kernel_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
400, "Duplicate Kernel update request received for Id: %s." % kernel_id
400, "Duplicate configure kernel request received for kernel: %s." % kernel_id

"An existing restart request is still in progress. Skipping this request."
)
raise web.HTTPError(
400, "Duplicate Kernel update request received for Id: %s." % kernel_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
400, "Duplicate Kernel update request received for Id: %s." % kernel_id
400, "Duplicate configure kernel request received for kernel: %s." % kernel_id

Comment on lines +248 to +253
response = requests.post(
self.update_kernel_url,
data=json.dumps(payload_dict, default=str),
headers=self.headers,
verify=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should address this alert somehow.

enterprise_gateway/services/kernels/remotemanager.py Outdated Show resolved Hide resolved
enterprise_gateway/services/kernels/remotemanager.py Outdated Show resolved Hide resolved
@rahul26goyal
Copy link
Contributor Author

@kevin-bates :
I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

  1. do we call it " refreshing kernel" or "re-configuring kernel" ?
  2. do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages.
pls give this some thought

except ValueError as ve:
logger.exception(f"Could not parse JSON object from input {cell}: error: {ve}.")
return ConfigureMagic.INVALID_JSON_PAYLOAD
except JSONDecodeError as jde:

Check failure

Code scanning / CodeQL

Unreachable 'except' block

Except block for [JSONDecodeError](1) is unreachable; the more general [except block](2) for [ValueError](3) will always be executed in preference.
@kevin-bates
Copy link
Member

@kevin-bates : I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure API:

  1. do we call it " refreshing kernel" or "re-configuring kernel" ?
  2. do we change the api to api/refresh/<kernel_id> and call this "refreshing kerne" operation?

we need use this term in both logs and response messages. pls give this some thought

I guess refresh seems a little easier to understand than reconfigure. Does this imply the magic name would change to %%refresh and does that conflict with existing magics? I think having the terminology match the magic name would be helpful.

I would also like to see the endpoint be under api/kernels rather than a sibling to api/kernels. Do you agree? If not, could you please help me understand why not? Is adding an endpoint under api/kernels violating some kind of convention?

@kevin-bates
Copy link
Member

Hi @rahul26goyal - what is the status of this PR since it's been about 6 weeks since its last update and it seems there are a few things still to work out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants