Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Docs] Explain how to specify container command for head pod #912

Merged
merged 13 commits into from
Feb 22, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Feb 17, 2023

Why are these changes needed?

Users want to execute some commands at two timings:

  • (1) Before ray start: Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.

  • (2) After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

Related issue number

Closes #651

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
  • Timing 1 example
    Screen Shot 2023-02-17 at 11 11 32 AM

  • Timing 2, Solution 1 example
    Screen Shot 2023-02-17 at 11 12 27 AM

  • Timing 2, Solution 2 example
    Screen Shot 2023-02-17 at 11 16 42 AM

@kevin85421 kevin85421 changed the title WIP: head command [Feature][Docs] Explain how to specify container command for head pod Feb 17, 2023
@kevin85421 kevin85421 marked this pull request as ready for review February 17, 2023 19:20
@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

@akanso
Copy link
Collaborator

akanso commented Feb 17, 2023

I wonder if Ray start should come first and be the prefix for the user command?

ray start -- .... && user command

I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

What is the dominant use case here from what we hear from the users?

@swaroopch
Copy link

Thank you @kevin85421 ! Based on the documentation here, was able to use "command": ["/home/ray/run-jupyter.sh"] and get Jupyter Lab notebook server running on the head node.

@kevin85421
Copy link
Member Author

I wonder if Ray start should come first and be the prefix for the user command?

ray start -- .... && user command

I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

// If 'ray start' has --block specified, commands after it will not get executed.
// so we need to put cmd before cont.

What is the dominant use case here from what we hear from the users?

Users want to execute some commands at two timings:

  • Before ray start:

    • Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.
    • @swaroopch wants to launch a Jupyter notebook.
  • After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

In addition, container commands can make RayJob become much more stable by reducing some non-idempotent operations. See #756 for more details.

@akanso
Copy link
Collaborator

akanso commented Feb 17, 2023

I wonder if Ray start should come first and be the prefix for the user command?
ray start -- .... && user command
I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

// If 'ray start' has --block specified, commands after it will not get executed.
// so we need to put cmd before cont.

What is the dominant use case here from what we hear from the users?

Users want to execute some commands at two timings:

  • Before ray start:

    • Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.
    • @swaroopch wants to launch a Jupyter notebook.
  • After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

In addition, container commands can make RayJob become much more stable by reducing some non-idempotent operations. See #756 for more details.

Should we have the --block option as a part of Ray-Start-Parameters in the head pod config?

@gvspraveen
Copy link
Contributor

gvspraveen commented Feb 19, 2023

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

I am looking at following line which seems to indicate that blocking is effective after the end of all commands. Probably order doesnt matter? Am I missing something?

args = args + " && sleep infinity"

@kevin85421
Copy link
Member Author

kevin85421 commented Feb 20, 2023

Should we have the --block option as a part of Ray-Start-Parameters in the head pod config?
@akanso
Yes, --block has already been a part of Ray-Start-Parameters.

  • With --block in the Ray-Start-Parameters, the option --block will be appended after the ray start command.
# RayCluster
rayStartParams:
  ...
  block: 'true'

# kubectl describe pod $HEAD_POD 
Command:
  /bin/bash
  -lc
  --
Args:
  ulimit -n 65536; ray start --head  --block  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=1  --memory=2000000000
  • Without --block, sleep infinity will be appended to prevent the Docker container from getting into the exited status immediately (link).
# RayCluster
rayStartParams:
  ...
  # block: 'true'

Command:
  /bin/bash
  -lc
  --
Args:
  ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=1  --memory=2000000000  && sleep infinity

I am looking at following line which seems to indicate that blocking is effective after the end of all commands. Probably order doesnt matter? Am I missing something?

@gvspraveen

There is still a little difference between --block and sleep infinity. See #675 for more details.

We did not encourage users to run ray start without --block.

  • Without --block, we need to append sleep infinity to the end of the ray start command to keep the container running.
  • With --block, when the ray process crashes, the KubeRay operator can detect the unhealthy condition in a short time because the container will exit immediately. Without --block, the unhealthy condition can still be detected by both readiness and liveness probes, but it may take more time to detect it.

In addition, from @DmitriGekhtman's comment #675 (comment), "Later, we can consider inject the --block automatically.". I opened an issue to track the progress #915.

@Yicheng-Lu-llll
Copy link
Contributor

LGTM!

To summarize based on my understanding:

  1. execute commands before ray start:

    • Set headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args
  2. execute commands after ray start:

    • Same as 1 except we need to add logic to wait for ray head to start and run the script in the background
    • Use postStart hook, but as it is not guaranteed to be executed after ENTRYPOINT, we still need to add logic to wait for ray head to start. Also, we are unable to see the log using kubectl logs {HeadNode} .
  3. The difficulty to execute commands after ray start is that block parameter of ray start will prevent executing the command after it. But the block parameter is recommended and may even be injected automatically in the future.

btw, I see the code here actually allows user to add ray start and rayStartParams in headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args. It may potentialy overwrite the spec.rayStartParams. So, how about add logic to forbide user to write ray start and rayStartParams in these places?

@kevin85421
Copy link
Member Author

@Yicheng-Lu-llll Thank you for your review!

To summarize based on my understanding: ...

Your understanding is correct.

btw, I see the code here actually allows user to add ray start and rayStartParams in headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args. It may potentialy overwrite the spec.rayStartParams. So, how about add logic to forbide user to write ray start and rayStartParams in these places?

Agree. The logic is very weird, but removing the logic will break the backward compatibility. I will open a new issue (#917) to track the progress and discuss whether we need to remove the logic.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just had some suggestions about readability.

If we plan to change the container command logic soon, no need to optimize the readability too much here; feel free to ignore the suggestions.

docs/guidance/head-command.md Outdated Show resolved Hide resolved
docs/guidance/head-command.md Outdated Show resolved Hide resolved
# `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
command: ["echo 123"]
```
* Running head Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have a way to run on the worker pod? If not, we should state this explicitly somewhere in the beginning, and possibly we should get rid of this section title

Suggested change
* Running head Pod
* Running on the head Pod

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yes, we can use the same method to specify container commands on the worker Pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Then we can update the top of the section to be more general

# Specify container commands for head Pod
You can execute commands on the head pod at two timings:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. 642cbd6

Comment on lines 33 to 37
* `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
* `spec.containers.0.args` contains two parts:
* (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together. In this example, the string will be `echo 123` because `headGroupSpec.template.spec.containers.0.args` is not defined here.
* (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
* To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is logically complete, but it still took some time for me to understand. Is my understanding correct that the spec.containers.0.command and spec.containers.0.args provided by the user get modified and rewritten by the operator? That might be the confusing part, and it would help to start by stating that explicitly. To be precise:

spec.containers.0.args <-- spec.containers.0.command && ray start ...
spec.containers.0.command <-- ["/bin/bash", "-lc", "--"]

Also, what happens if the user specifies "args"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall impression is that this modification logic is an implementation detail that most users won't care about, and we can remove it from the docs altogether (and maybe just point to the relevant part of the code for advanced users, or bump it below to a section called "Advanced")

My guess is that all users really want to know from this doc is that if you put command: ["echo 123"], it will run before ray start.

I could be wrong about this though, I'd be interested what other reviewers think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that the spec.containers.0.command and spec.containers.0.args provided by the user get modified and rewritten by the operator?

Yes, users provide headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args in RayCluster CR, and they will be modified and rewritten by the operator to create spec.containers.0.command and spec.containers.0.args.

Also, what happens if the user specifies "args"?

spec.containers.0.args <-- [headGroupSpec...command] [headGroupSpec...args] && [ray start ...]

# Example RayCluster
command: ["echo 123"]
args: ["456"]

# Pod args will be
echo 123  456  && [ray start ...]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added args to the RayCluster example. See f4a47ee for more details.

# {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
# INFO: Print Ray cluster resources
```
* The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the pros and cons of either approach at the beginning of this section? Also, since only one pro is listed, it sounds like Solution 2 is strictly better. Is there any reason to use Solution 1?

If we have a clear recommendation we could put (Recommended) for one of them to reduce the decisions the user has to make.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. 9f63c63

kevin85421 and others added 5 commits February 21, 2023 11:44
Co-authored-by: Archit Kulkarni <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Archit Kulkarni <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
@@ -0,0 +1,148 @@
# Specify container commands for Ray head/worker Pods
You can execute commands on the head/worker pods at two timings:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block at that time. I think eventually, we determine to have the flag for different users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?

https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1675378764037199
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959

The document has already provided some real-world scenarios based on the two slack threads above.

I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block at that time. I think eventually, we determine to have the flag for different users.

#912 (comment)
Yes, --block has already been a part of Ray-Start-Parameters.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 21, 2023

Technically looks good to me. The lifecycle way is a little bit tricky and it's hard to debug the commands. user has to check events for more details. I would say this works technically but not a recommended way. I am thinking whether we should put it as a solution

@kevin85421
Copy link
Member Author

Technically looks good to me. The lifecycle way is a little bit tricky and it's hard to debug the commands. user has to check events for more details. I would say this works technically but not a recommended way. I am thinking whether we should put it as a solution

We currently say that Solution 1 (container command) is the recommended solution and compare the difference between Solution 1 and Solution 2 (postStart). The postStart solution is provided by a user, and I am not sure how does he choose between postStart and container command.

I am not familiar with postStart, but it still has its advantages based on Kubernetes's document.

Kubernetes sends the postStart event immediately after the Container is created. There is no guarantee, however, that the postStart handler is called before the Container's entrypoint is called. The postStart handler runs asynchronously relative to the Container's code, but Kubernetes' management of the container blocks until the postStart handler completes. The Container's status is not set to RUNNING until the postStart handler completes.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 22, 2023

/lgtm

@kevin85421 kevin85421 merged commit 0564748 into ray-project:master Feb 22, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ray-project#912)

Users want to execute some commands at two timings: (1) Before `ray start` (2) After `ray start`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][Docs] Explain how to specify container command for head pod
7 participants