-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Docs] Explain how to specify container command for head pod #912
Conversation
I wonder if Ray start should come first and be the prefix for the user command?
I am thinking the user might want ray to start first, and then follow by running their Python application that runs on top of ray (after ray has started); e.g. What is the dominant use case here from what we hear from the users? |
Thank you @kevin85421 ! Based on the documentation here, was able to use |
Thank @akanso for the comments! The user-specified command comes first because the option kuberay/ray-operator/controllers/ray/common/pod.go Lines 313 to 314 in 4714892
Users want to execute some commands at two timings:
In addition, container commands can make RayJob become much more stable by reducing some non-idempotent operations. See #756 for more details. |
Should we have the --block option as a part of Ray-Start-Parameters in the head pod config? |
I am looking at following line which seems to indicate that blocking is effective after the end of all commands. Probably order doesnt matter? Am I missing something?
|
# RayCluster
rayStartParams:
...
block: 'true'
# kubectl describe pod $HEAD_POD
Command:
/bin/bash
-lc
--
Args:
ulimit -n 65536; ray start --head --block --dashboard-host=0.0.0.0 --metrics-export-port=8080 --num-cpus=1 --memory=2000000000
# RayCluster
rayStartParams:
...
# block: 'true'
Command:
/bin/bash
-lc
--
Args:
ulimit -n 65536; ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8080 --num-cpus=1 --memory=2000000000 && sleep infinity
There is still a little difference between We did not encourage users to run ray start without
In addition, from @DmitriGekhtman's comment #675 (comment), "Later, we can consider inject the |
LGTM! To summarize based on my understanding:
btw, I see the code here actually allows user to add |
@Yicheng-Lu-llll Thank you for your review!
Your understanding is correct.
Agree. The logic is very weird, but removing the logic will break the backward compatibility. I will open a new issue (#917) to track the progress and discuss whether we need to remove the logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just had some suggestions about readability.
If we plan to change the container command logic soon, no need to optimize the readability too much here; feel free to ignore the suggestions.
docs/guidance/head-command.md
Outdated
# `command` and `args` will become a part of `spec.containers.0.args` in the head Pod. | ||
command: ["echo 123"] | ||
``` | ||
* Running head Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also have a way to run on the worker pod? If not, we should state this explicitly somewhere in the beginning, and possibly we should get rid of this section title
* Running head Pod | |
* Running on the head Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Yes, we can use the same method to specify container commands on the worker Pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! Then we can update the top of the section to be more general
# Specify container commands for head Pod
You can execute commands on the head pod at two timings:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. 642cbd6
docs/guidance/head-command.md
Outdated
* `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`. | ||
* `spec.containers.0.args` contains two parts: | ||
* (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together. In this example, the string will be `echo 123` because `headGroupSpec.template.spec.containers.0.args` is not defined here. | ||
* (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`. | ||
* To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is logically complete, but it still took some time for me to understand. Is my understanding correct that the spec.containers.0.command
and spec.containers.0.args
provided by the user get modified and rewritten by the operator? That might be the confusing part, and it would help to start by stating that explicitly. To be precise:
spec.containers.0.args <-- spec.containers.0.command && ray start ...
spec.containers.0.command <-- ["/bin/bash", "-lc", "--"]
Also, what happens if the user specifies "args"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall impression is that this modification logic is an implementation detail that most users won't care about, and we can remove it from the docs altogether (and maybe just point to the relevant part of the code for advanced users, or bump it below to a section called "Advanced")
My guess is that all users really want to know from this doc is that if you put command: ["echo 123"]
, it will run before ray start
.
I could be wrong about this though, I'd be interested what other reviewers think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that the
spec.containers.0.command
andspec.containers.0.args
provided by the user get modified and rewritten by the operator?
Yes, users provide headGroupSpec.template.spec.containers.0.command
and headGroupSpec.template.spec.containers.0.args
in RayCluster CR, and they will be modified and rewritten by the operator to create spec.containers.0.command
and spec.containers.0.args
.
Also, what happens if the user specifies "args"?
spec.containers.0.args <-- [headGroupSpec...command] [headGroupSpec...args] && [ray start ...]
# Example RayCluster
command: ["echo 123"]
args: ["456"]
# Pod args will be
echo 123 456 && [ray start ...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added args
to the RayCluster example. See f4a47ee for more details.
docs/guidance/head-command.md
Outdated
# {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0} | ||
# INFO: Print Ray cluster resources | ||
``` | ||
* The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the pros and cons of either approach at the beginning of this section? Also, since only one pro is listed, it sounds like Solution 2 is strictly better. Is there any reason to use Solution 1?
If we have a clear recommendation we could put (Recommended) for one of them to reduce the decisions the user has to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. 9f63c63
Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>
@@ -0,0 +1,148 @@ | |||
# Specify container commands for Ray head/worker Pods | |||
You can execute commands on the head/worker pods at two timings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block
at that time. I think eventually, we determine to have the flag for different users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some real scenarios for users to better understand why they need to run commands along with the cluster setup?
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1675378764037199
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959
The document has already provided some real-world scenarios based on the two slack threads above.
I think Microsoft did similar way in the past. they like to run some code after cluster is running, and it has conflict with --block at that time. I think eventually, we determine to have the flag for different users.
#912 (comment)
Yes, --block
has already been a part of Ray-Start-Parameters.
Technically looks good to me. The lifecycle way is a little bit tricky and it's hard to debug the commands. user has to check events for more details. I would say this works technically but not a recommended way. I am thinking whether we should put it as a |
We currently say that Solution 1 (container command) is the recommended solution and compare the difference between Solution 1 and Solution 2 ( I am not familiar with
|
/lgtm |
…ray-project#912) Users want to execute some commands at two timings: (1) Before `ray start` (2) After `ray start`
Why are these changes needed?
Users want to execute some commands at two timings:
(1) Before
ray start
: Take this slack thread as an example, the user wants to set up some environment variables that will be used byray start
.(2) After
ray start
(RayCluster is ready): Take this slack thread as an example, the user wants to launch a Ray serve deployment when the RayCluster is ready.Related issue number
Closes #651
Checks
Timing 1 example
Timing 2, Solution 1 example
Timing 2, Solution 2 example