[Feature][Docs] Explain how to specify container command for head pod #912

kevin85421 · 2023-02-17T18:03:38Z

Why are these changes needed?

Users want to execute some commands at two timings:

(1) Before ray start: Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.
(2) After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

Related issue number

Closes #651

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Timing 1 example
Timing 2, Solution 1 example
Timing 2, Solution 2 example

kevin85421 · 2023-02-17T19:22:26Z

cc @Yicheng-Lu-llll

akanso · 2023-02-17T20:02:48Z

I wonder if Ray start should come first and be the prefix for the user command?

ray start -- .... && user command

I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

What is the dominant use case here from what we hear from the users?

swaroopch · 2023-02-17T20:26:50Z

Thank you @kevin85421 ! Based on the documentation here, was able to use "command": ["/home/ray/run-jupyter.sh"] and get Jupyter Lab notebook server running on the head node.

kevin85421 · 2023-02-17T20:50:16Z

I wonder if Ray start should come first and be the prefix for the user command?

ray start -- .... && user command

I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

kuberay/ray-operator/controllers/ray/common/pod.go

Lines 313 to 314 in 4714892

    
           // If 'ray start' has --block specified, commands after it will not get executed. 
        
           // so we need to put cmd before cont.

What is the dominant use case here from what we hear from the users?

Users want to execute some commands at two timings:

Before ray start:
- Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.
- @swaroopch wants to launch a Jupyter notebook.
After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

In addition, container commands can make RayJob become much more stable by reducing some non-idempotent operations. See #756 for more details.

akanso · 2023-02-17T23:24:40Z

I wonder if Ray start should come first and be the prefix for the user command?
ray start -- .... && user command
I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. ray start -- .... && python my_application.py

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

kuberay/ray-operator/controllers/ray/common/pod.go

Lines 313 to 314 in 4714892

// If 'ray start' has --block specified, commands after it will not get executed.

// so we need to put cmd before cont.

What is the dominant use case here from what we hear from the users?

Users want to execute some commands at two timings:

Before ray start:

Take this slack thread as an example, the user wants to set up some environment variables that will be used by ray start.

@swaroopch wants to launch a Jupyter notebook.

After ray start (RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.

In addition, container commands can make RayJob become much more stable by reducing some non-idempotent operations. See #756 for more details.

Should we have the --block option as a part of Ray-Start-Parameters in the head pod config?

gvspraveen · 2023-02-19T22:07:36Z

Thank @akanso for the comments! The user-specified command comes first because the option --block will make the ray start command block forever (ref).

I am looking at following line which seems to indicate that blocking is effective after the end of all commands. Probably order doesnt matter? Am I missing something?

kuberay/ray-operator/controllers/ray/common/pod.go

Line 322 in 4714892

args = args + " && sleep infinity"

kevin85421 · 2023-02-20T06:17:10Z

Should we have the --block option as a part of Ray-Start-Parameters in the head pod config?
@akanso
Yes, --block has already been a part of Ray-Start-Parameters.

With --block in the Ray-Start-Parameters, the option --block will be appended after the ray start command.

# RayCluster
rayStartParams:
  ...
  block: 'true'

# kubectl describe pod $HEAD_POD 
Command:
  /bin/bash
  -lc
  --
Args:
  ulimit -n 65536; ray start --head  --block  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=1  --memory=2000000000

Without --block, sleep infinity will be appended to prevent the Docker container from getting into the exited status immediately (link).

# RayCluster
rayStartParams:
  ...
  # block: 'true'

Command:
  /bin/bash
  -lc
  --
Args:
  ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=1  --memory=2000000000  && sleep infinity

I am looking at following line which seems to indicate that blocking is effective after the end of all commands. Probably order doesnt matter? Am I missing something?

@gvspraveen

There is still a little difference between --block and sleep infinity. See #675 for more details.

We did not encourage users to run ray start without --block.

Without --block, we need to append sleep infinity to the end of the ray start command to keep the container running.
With --block, when the ray process crashes, the KubeRay operator can detect the unhealthy condition in a short time because the container will exit immediately. Without --block, the unhealthy condition can still be detected by both readiness and liveness probes, but it may take more time to detect it.

In addition, from @DmitriGekhtman's comment #675 (comment), "Later, we can consider inject the --block automatically.". I opened an issue to track the progress #915.

Yicheng-Lu-llll · 2023-02-20T20:21:08Z

LGTM!

To summarize based on my understanding:

execute commands before ray start:
- Set headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args
execute commands after ray start:
- Same as 1 except we need to add logic to wait for ray head to start and run the script in the background
- Use postStart hook, but as it is not guaranteed to be executed after ENTRYPOINT, we still need to add logic to wait for ray head to start. Also, we are unable to see the log using kubectl logs {HeadNode} .
The difficulty to execute commands after ray start is that block parameter of ray start will prevent executing the command after it. But the block parameter is recommended and may even be injected automatically in the future.

btw, I see the code here actually allows user to add ray start and rayStartParams in headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args. It may potentialy overwrite the spec.rayStartParams. So, how about add logic to forbide user to write ray start and rayStartParams in these places?

kevin85421 · 2023-02-21T17:51:52Z

@Yicheng-Lu-llll Thank you for your review!

To summarize based on my understanding: ...

Your understanding is correct.

btw, I see the code here actually allows user to add ray start and rayStartParams in headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args. It may potentialy overwrite the spec.rayStartParams. So, how about add logic to forbide user to write ray start and rayStartParams in these places?

Agree. The logic is very weird, but removing the logic will break the backward compatibility. I will open a new issue (#917) to track the progress and discuss whether we need to remove the logic.

architkulkarni

Looks good! Just had some suggestions about readability.

If we plan to change the container command logic soon, no need to optimize the readability too much here; feel free to ignore the suggestions.

docs/guidance/head-command.md

architkulkarni · 2023-02-21T19:20:38Z

docs/guidance/head-command.md

+          # `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
+          command: ["echo 123"]
+```
+* Running head Pod


Do we also have a way to run on the worker pod? If not, we should state this explicitly somewhere in the beginning, and possibly we should get rid of this section title

Suggested change

* Running head Pod

* Running on the head Pod

Good catch! Yes, we can use the same method to specify container commands on the worker Pods.

Sweet! Then we can update the top of the section to be more general

# Specify container commands for head Pod You can execute commands on the head pod at two timings:

Updated. 642cbd6

architkulkarni · 2023-02-21T19:27:35Z

docs/guidance/head-command.md

+  * `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
+  * `spec.containers.0.args` contains two parts:
+    * (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together. In this example, the string will be `echo 123` because `headGroupSpec.template.spec.containers.0.args` is not defined here.
+    * (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
+    * To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.


I think this section is logically complete, but it still took some time for me to understand. Is my understanding correct that the spec.containers.0.command and spec.containers.0.args provided by the user get modified and rewritten by the operator? That might be the confusing part, and it would help to start by stating that explicitly. To be precise:

spec.containers.0.args <-- spec.containers.0.command && ray start ...
spec.containers.0.command <-- ["/bin/bash", "-lc", "--"]

Also, what happens if the user specifies "args"?

My overall impression is that this modification logic is an implementation detail that most users won't care about, and we can remove it from the docs altogether (and maybe just point to the relevant part of the code for advanced users, or bump it below to a section called "Advanced")

My guess is that all users really want to know from this doc is that if you put command: ["echo 123"], it will run before ray start.

I could be wrong about this though, I'd be interested what other reviewers think.

Is my understanding correct that the spec.containers.0.command and spec.containers.0.args provided by the user get modified and rewritten by the operator?

Yes, users provide headGroupSpec.template.spec.containers.0.command and headGroupSpec.template.spec.containers.0.args in RayCluster CR, and they will be modified and rewritten by the operator to create spec.containers.0.command and spec.containers.0.args.

Also, what happens if the user specifies "args"?

spec.containers.0.args <-- [headGroupSpec...command] [headGroupSpec...args] && [ray start ...] # Example RayCluster command: ["echo 123"] args: ["456"] # Pod args will be echo 123 456 && [ray start ...]

I added args to the RayCluster example. See f4a47ee for more details.

architkulkarni · 2023-02-21T19:33:31Z

docs/guidance/head-command.md

+    # {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
+    # INFO: Print Ray cluster resources
+    ```
+    * The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 2.


Can we put the pros and cons of either approach at the beginning of this section? Also, since only one pro is listed, it sounds like Solution 2 is strictly better. Is there any reason to use Solution 1?

If we have a clear recommendation we could put (Recommended) for one of them to reduce the decisions the user has to make.

Updated. 9f63c63

Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>

Jeffwan · 2023-02-21T23:24:42Z

docs/guidance/pod-command.md

@@ -0,0 +1,148 @@
+# Specify container commands for Ray head/worker Pods
+You can execute commands on the head/worker pods at two timings:


can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?

I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block at that time. I think eventually, we determine to have the flag for different users.

can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?

https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1675378764037199
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959

The document has already provided some real-world scenarios based on the two slack threads above.

I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block at that time. I think eventually, we determine to have the flag for different users.

#912 (comment)
Yes, --block has already been a part of Ray-Start-Parameters.

Jeffwan · 2023-02-21T23:28:30Z

Technically looks good to me. The lifecycle way is a little bit tricky and it's hard to debug the commands. user has to check events for more details. I would say this works technically but not a recommended way. I am thinking whether we should put it as a solution

kevin85421 · 2023-02-22T01:24:48Z

Technically looks good to me. The lifecycle way is a little bit tricky and it's hard to debug the commands. user has to check events for more details. I would say this works technically but not a recommended way. I am thinking whether we should put it as a solution

We currently say that Solution 1 (container command) is the recommended solution and compare the difference between Solution 1 and Solution 2 (postStart). The postStart solution is provided by a user, and I am not sure how does he choose between postStart and container command.

I am not familiar with postStart, but it still has its advantages based on Kubernetes's document.

Kubernetes sends the postStart event immediately after the Container is created. There is no guarantee, however, that the postStart handler is called before the Container's entrypoint is called. The postStart handler runs asynchronously relative to the Container's code, but Kubernetes' management of the container blocks until the postStart handler completes. The Container's status is not set to RUNNING until the postStart handler completes.

Jeffwan · 2023-02-22T22:43:23Z

/lgtm

…ray-project#912) Users want to execute some commands at two timings: (1) Before `ray start` (2) After `ray start`

kevin85421 added 3 commits February 17, 2023 02:50

update

2684d05

update

77e326a

add a newline

3998c55

kevin85421 changed the title ~~WIP: head command~~ [Feature][Docs] Explain how to specify container command for head pod Feb 17, 2023

kevin85421 marked this pull request as ready for review February 17, 2023 19:20

kevin85421 requested review from architkulkarni, gvspraveen, DmitriGekhtman, Jeffwan and tgaddair February 17, 2023 19:20

kevin85421 requested a review from akanso February 17, 2023 21:05

kevin85421 mentioned this pull request Feb 20, 2023

[Feature] Inject the --block option to ray start command automatically #915

Closed

2 tasks

helm chart support

f3d9748

kevin85421 mentioned this pull request Feb 21, 2023

[Feature] Update the logic of specifying container commands for head Pod #917

Closed

2 tasks

fix chart lint error

88f1316

architkulkarni approved these changes Feb 21, 2023

View reviewed changes

kevin85421 and others added 5 commits February 21, 2023 11:44

Update docs/guidance/head-command.md

65715da

Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>

Update docs/guidance/head-command.md

63bb47e

Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>

update

9f63c63

update

642cbd6

update

f4a47ee

kevin85421 added 3 commits February 21, 2023 21:32

rename

909c75a

update

c43e1f4

update

685723d

Jeffwan reviewed Feb 21, 2023

View reviewed changes

Jeffwan approved these changes Feb 22, 2023

View reviewed changes

kevin85421 merged commit 0564748 into ray-project:master Feb 22, 2023

Yicheng-Lu-llll mentioned this pull request Feb 26, 2023

Inject the --block option to ray start command automatically #932

Merged

Yicheng-Lu-llll mentioned this pull request Mar 24, 2023

remove ray-cluster.getting-started.yaml #987

Merged

4 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Feature][Docs] Explain how to specify container command for head pod (…

bcb6b3e

…ray-project#912) Users want to execute some commands at two timings: (1) Before `ray start` (2) After `ray start`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Docs] Explain how to specify container command for head pod #912

[Feature][Docs] Explain how to specify container command for head pod #912

kevin85421 commented Feb 17, 2023 •

edited

Loading

kevin85421 commented Feb 17, 2023

akanso commented Feb 17, 2023

swaroopch commented Feb 17, 2023

kevin85421 commented Feb 17, 2023

akanso commented Feb 17, 2023

gvspraveen commented Feb 19, 2023 •

edited

Loading

kevin85421 commented Feb 20, 2023 •

edited

Loading

Yicheng-Lu-llll commented Feb 20, 2023

kevin85421 commented Feb 21, 2023

architkulkarni left a comment

architkulkarni Feb 21, 2023

kevin85421 Feb 21, 2023

architkulkarni Feb 21, 2023

kevin85421 Feb 21, 2023

architkulkarni Feb 21, 2023

architkulkarni Feb 21, 2023

kevin85421 Feb 21, 2023

kevin85421 Feb 21, 2023

architkulkarni Feb 21, 2023

kevin85421 Feb 21, 2023

Jeffwan Feb 21, 2023

Jeffwan Feb 21, 2023

kevin85421 Feb 22, 2023

Jeffwan commented Feb 21, 2023

kevin85421 commented Feb 22, 2023

Jeffwan commented Feb 22, 2023

		@@ -0,0 +1,148 @@
		# Specify container commands for Ray head/worker Pods
		You can execute commands on the head/worker pods at two timings:

[Feature][Docs] Explain how to specify container command for head pod #912

[Feature][Docs] Explain how to specify container command for head pod #912

Conversation

kevin85421 commented Feb 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Feb 17, 2023

akanso commented Feb 17, 2023

swaroopch commented Feb 17, 2023

kevin85421 commented Feb 17, 2023

akanso commented Feb 17, 2023

gvspraveen commented Feb 19, 2023 • edited Loading

kevin85421 commented Feb 20, 2023 • edited Loading

Yicheng-Lu-llll commented Feb 20, 2023

kevin85421 commented Feb 21, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Feb 21, 2023

kevin85421 commented Feb 22, 2023

Jeffwan commented Feb 22, 2023

kevin85421 commented Feb 17, 2023 •

edited

Loading

gvspraveen commented Feb 19, 2023 •

edited

Loading

kevin85421 commented Feb 20, 2023 •

edited

Loading