-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise sample configs, increase memory requests, update Ray versions #761
Revise sample configs, increase memory requests, update Ray versions #761
Conversation
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @DmitriGekhtman for this contribution! Would you mind to update:
(1) rayVersion
and image
for all sample YAML files in ray-operator/config/samples
(2) Update the following line to the latest ray release (i.e. 2.1.0)
os.getenv('RAY_IMAGE', default='rayproject/ray:2.0.0'), |
(3) Run configuration tests on your local machine. Currently, only 3 small configuration YAMLs can be tested in KubeRay CI. We can test ray-cluster.autoscaler.yaml
and ray-cluster.complete.yaml
by running the following command in your local environment:
python3 test_sample_raycluster_yamls.py 2>&1 | tee log
kuberay/tests/framework/test_sample_raycluster_yamls.py
Lines 29 to 31 in 49c44bf
"ray-cluster.getting-started.yaml", | |
"ray-cluster.ingress.yaml", | |
"ray-cluster.mini.yaml" |
By the way, would you mind describing more about the relationship between increasing memory requests and OOM? Thanks! Without the context of GKE, it is hard to understand. For me, I thought a Pod will be killed by OOM when (1) Memory usage exceeds (2) When the cluster runs out of resources, Pods with memory usage (>= memory request, but <= memory limit) will be killed before those Pods with memory usage (<= memory request). => In this case, making Only in the context of GKE autopilot, it makes sense to avoid OOM by increasing memory requests. Thanks! |
Right, there are two related motivations:
|
In particular, if you don't set requests=limits, you're prone to unexpected results due to the K8s behavior you've described. |
Signed-off-by: Dmitri Gekhtman <[email protected]>
TODO: Run configuration tests. I'll also add a blurb to the development docs about these tests (if that's not there already) |
The update looks good to me. Feel free to merge it if the configuration tests pass. |
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Update the version of Ray from 2.0.0 to 2.1.0 in configuration test CI.
kuberay/.github/workflows/test-job.yaml
Lines 343 to 360 in 49c44bf
sample-yaml-config-test-2_0_0: needs: - build_operator - build_apiserver - lint runs-on: ubuntu-latest name: Sample YAML Config Test - 2.0.0 steps: - name: Check out code into the Go module directory uses: actions/checkout@v2 with: # When checking out the repository that # triggered a workflow, this defaults to the reference or SHA for that event. # Default value should work for both pull_request and merge(push) event. ref: ${{github.event.pull_request.head.sha}} - uses: ./.github/workflows/actions/configuration with: ray_version: 2.0.0 -
Update the default Ray image from 2.0.0 to 2.1.0 in configuration test scripts.
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Need to debug some of the tests -- great to see the tests catching misconfigurations. |
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Woooooo, the CI is green after #759 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most parts look good to me. I did't check test framework part. leave it to Kevin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, but it is hard for me to verify this big PR. The good thing is that I will check all sample YAML files [1] before v0.4.0 release except ray-cluster.complete.large.yaml
and ray-cluster.autoscaler.large.yaml
. Hence, feel free to merge this PR if you double-check the two large YAML files. Maybe we can update their memory requests, cpu requests, and repliacs to enable them to be tested on your laptop.
@@ -44,30 +44,13 @@ spec: | |||
headGroupSpec: | |||
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer' | |||
serviceType: ClusterIP | |||
# the pod replicas in this group typed head (assuming there could be more than 1 in the future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out!
@@ -44,30 +44,18 @@ spec: | |||
headGroupSpec: | |||
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer' | |||
serviceType: ClusterIP | |||
# the pod replicas in this group typed head (assuming there could be more than 1 in the future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
I'll double check those. |
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
I tested raycluster.complete.large and raycluster.autoscaler.large on GKE -- everything lgtm. |
CI looks nice and green. Manual testing does not reveal any issues. Merging! |
Closes #805 which was caused by incomplete config cleanup in #761 . Signed-off-by: Dmitri Gekhtman <[email protected]>
…ay-project#761) This PR - Cleans up and updates sample configs and test files - Increments Ray versions to the latest Ray release (2.1.0) - Memory resource requests are increased in some sample configs. - The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed. Signed-off-by: Dmitri Gekhtman <[email protected]>
Closes ray-project#805 which was caused by incomplete config cleanup in ray-project#761 . Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman [email protected]
Why are these changes needed?
This PR
Memory resource requests are increased in some sample configs.
The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed.
Related issue number
Checks