-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Add default CPU and memory for job submitter pod #1319
[RayJob] Add default CPU and memory for job submitter pod #1319
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@@ -112,6 +113,16 @@ func GetDefaultSubmitterTemplate(rayJobInstance *rayv1alpha1.RayJob) v1.PodTempl | |||
{ | |||
Name: "ray-job-submitter", | |||
Image: image, | |||
Resources: v1.ResourceRequirements{ | |||
Limits: v1.ResourceList{ | |||
v1.ResourceCPU: resource.MustParse("1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we select 1 CPU and 1 GB of memory rather than other values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a guess. Let me try to benchmark it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated using the benchmark result.
@@ -67,23 +67,3 @@ runs: | |||
run: | | |||
python tests/test_sample_raycluster_yamls.py | |||
shell: bash | |||
|
|||
- name: Run tests for sample RayJob YAML files with the nightly operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this one? Has this one been moved to Buildkite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this first: #1321 it adds it to buildkite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears this has already been removed in some other PR.
This is necessary for Flyte. Let's merge this PR before v1.0.0-rc.1 cc @pingsutw |
Got it, I'll prioritize the benchmarking for figure out how much CPU and memory to use. Until then, I guess Flyte can configure the submitterpodtemplate themselves, because we expose this in the RayJob yaml. |
Flyte provides a Python SDK for its Ray users. However, the current SDK does not expose |
Signed-off-by: Archit Kulkarni <[email protected]>
This reverts commit d56f803.
…eray into add-resources
…add-resources Signed-off-by: Archit Kulkarni <[email protected]>
Tests are ok, |
@kevin85421 any other concerns about this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you also update the doc in the Ray repo?
…t#1319) Adds a default CPU and memory resource requirement for the pod that submits the RayJob. image Here's the memory and CPU usage of the job submitter pod with a job entrypoint script that prints "hello world" in a tight loop from Python. I tested it on a local kind cluster. Since the job submitter pod just calls ray job submit which runs the job remotely and streams the job output to stdout on the submitter pod, this kind of log streaming job should be the most stressful from the perspective of the job submitter pod. Based on this, we set the requests to 500m CPU and 200MiB memory, and set the limits to 1 CPU and 1GiB memory. Why are these changes needed? It's good to provide reasonable defaults to reduce the burden on the user. Related issue number Closes ray-project#1198 --------- Signed-off-by: Archit Kulkarni <[email protected]>
Adds a default CPU and memory resource requirement for the pod that submits the RayJob.
Here's the memory and CPU usage of the job submitter pod with a job entrypoint script that prints "hello world" in a tight loop from Python. I tested it on a local
kind
cluster. Since the job submitter pod just callsray job submit
which runs the job remotely and streams the job output to stdout on the submitter pod, this kind of log streaming job should be the most stressful from the perspective of the job submitter pod.Based on this, we set the requests to 500m CPU and 200MiB memory, and set the limits to 1 CPU and 1GiB memory.
Why are these changes needed?
It's good to provide reasonable defaults to reduce the burden on the user.
Related issue number
Closes #1198
Checks