Revise sample configs, increase memory requests, update Ray versions #761

DmitriGekhtman · 2022-11-25T17:22:37Z

Signed-off-by: Dmitri Gekhtman [email protected]

Why are these changes needed?

This PR

Cleans up and updates sample configs and test files
Increments Ray versions to the latest Ray release (2.1.0)

Memory resource requests are increased in some sample configs.
The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421

Thank @DmitriGekhtman for this contribution! Would you mind to update:

(1) rayVersion and image for all sample YAML files in ray-operator/config/samples

(2) Update the following line to the latest ray release (i.e. 2.1.0)

kuberay/tests/framework/test_sample_raycluster_yamls.py

Line 57 in 49c44bf

os.getenv('RAY_IMAGE', default='rayproject/ray:2.0.0'),

(3) Run configuration tests on your local machine. Currently, only 3 small configuration YAMLs can be tested in KubeRay CI. We can test ray-cluster.autoscaler.yaml and ray-cluster.complete.yaml by running the following command in your local environment:

python3 test_sample_raycluster_yamls.py 2>&1 | tee log

kuberay/tests/framework/test_sample_raycluster_yamls.py

Lines 29 to 31 in 49c44bf

    
           "ray-cluster.getting-started.yaml", 
        
           "ray-cluster.ingress.yaml", 
        
           "ray-cluster.mini.yaml"

kevin85421 · 2022-11-25T19:07:29Z

By the way, would you mind describing more about the relationship between increasing memory requests and OOM? Thanks!

Without the context of GKE, it is hard to understand. For me, I thought a Pod will be killed by OOM when

(1) Memory usage exceeds memory limit. => In this case, making memory request equal to memory limit cannot resolve OOM.

(2) When the cluster runs out of resources, Pods with memory usage (>= memory request, but <= memory limit) will be killed before those Pods with memory usage (<= memory request). => In this case, making memory request equal to memory limit will make other Pods OOM or Pods cannot be scheduled.

Only in the context of GKE autopilot, it makes sense to avoid OOM by increasing memory requests. Thanks!

DmitriGekhtman · 2022-11-25T21:20:37Z

Right, there are two related motivations:

GKE autopilot has a tendency to automatically decrease memory limits to requests, which will cause unexpected OOMs.
Ray has no understanding of limits vs. requests, as Ray was designed with VMs in mind. KubeRay advertises the memory limit to Ray as the Ray node's memory capacity and that's all that Ray sees. Thus, for consistency of Ray's behavior, it's best to simply line up limits and requests.
Otherwise, whether your Ray pod OOMs or not is sensitive to the K8s environment in ways that are difficult to understand.
[K8s container limit] == [K8s container request] == [Ray node resource capacity] is much easier to work with.

DmitriGekhtman · 2022-11-25T21:26:20Z

(2) When the cluster runs out of resources, Pods with memory usage (>= memory request, but <= memory limit) will be killed before those Pods with memory usage (<= memory request). => In this case, making memory request equal to memory limit will make other Pods OOM or Pods cannot be scheduled.

In particular, if you don't set requests=limits, you're prone to unexpected results due to the K8s behavior you've described.
You are more prone to noisy neighbors issues that depend on what your K8s nodes look like and what other workloads are running on these nodes. So when a Ray worker OOMs, you won't know if you made a mistake setting up your Ray workload or if another user was running a resource-intensive application that overlapped on the same physical K8s nodes.

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-11-25T21:42:20Z

TODO: Run configuration tests.

I'll also add a blurb to the development docs about these tests (if that's not there already)

kevin85421 · 2022-11-26T06:51:06Z

The update looks good to me. Feel free to merge it if the configuration tests pass.

Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421

Update the version of Ray from 2.0.0 to 2.1.0 in configuration test CI.

kuberay/.github/workflows/test-job.yaml

Lines 343 to 360 in 49c44bf

    
           sample-yaml-config-test-2_0_0: 
        
             needs: 
        
               - build_operator 
        
               - build_apiserver 
        
               - lint 
        
             runs-on: ubuntu-latest 
        
             name: Sample YAML Config Test - 2.0.0 
        
             steps: 
        
               - name: Check out code into the Go module directory 
        
                 uses: actions/checkout@v2 
        
                 with: 
        
                   # When checking out the repository that 
        
                   # triggered a workflow, this defaults to the reference or SHA for that event. 
        
                   # Default value should work for both pull_request and merge(push) event. 
        
                   ref: ${{github.event.pull_request.head.sha}} 
        
               - uses: ./.github/workflows/actions/configuration 
        
                 with: 
        
                   ray_version: 2.0.0

Update the default Ray image from 2.0.0 to 2.1.0 in configuration test scripts.
- test_sample_raycluster_yamls.py#L57
- test_sample_rayservice_yamls.py#L33

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-11-28T19:53:20Z

Need to debug some of the tests -- great to see the tests catching misconfigurations.

Signed-off-by: Dmitri Gekhtman <[email protected]>

tests/config/ray-cluster.mini.yaml.template

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-12-01T06:23:02Z

Woooooo, the CI is green after #759
Nice work @kevin85421. Would you mind giving this PR a final review?

Jeffwan

most parts look good to me. I did't check test framework part. leave it to Kevin

kevin85421

It looks good to me, but it is hard for me to verify this big PR. The good thing is that I will check all sample YAML files [1] before v0.4.0 release except ray-cluster.complete.large.yaml and ray-cluster.autoscaler.large.yaml. Hence, feel free to merge this PR if you double-check the two large YAML files. Maybe we can update their memory requests, cpu requests, and repliacs to enable them to be tested on your laptop.

[1] https://docs.google.com/spreadsheets/d/1wlTXCWNtQxCUENa0fP2-dV6UYNUhUCix4exiGsep5GQ/edit#gid=1603824512

kevin85421 · 2022-12-01T08:20:14Z

tests/config/ray-service-cluster-update.yaml.template

@@ -44,30 +44,13 @@ spec:
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
-      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)


This file should be deleted.

https://github.com/ray-project/kuberay/pull/759/files#diff-db7eeddbec6895cf05cb70312b97d42f54bdd61ac34ee18d3f74d02ab349b5bfL274

Thanks for pointing that out!

kevin85421 · 2022-12-01T08:20:32Z

tests/config/ray-service-serve-update.yaml.template

@@ -44,30 +44,18 @@ spec:
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
-      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)


This file should be deleted.

https://github.com/ray-project/kuberay/pull/759/files#diff-db7eeddbec6895cf05cb70312b97d42f54bdd61ac34ee18d3f74d02ab349b5bfL274

DmitriGekhtman · 2022-12-01T19:21:19Z

It looks good to me, but it is hard for me to verify this big PR. The good thing is that I will check all sample YAML files [1] before v0.4.0 release except ray-cluster.complete.large.yaml and ray-cluster.autoscaler.large.yaml. Hence, feel free to merge this PR if you double-check the two large YAML files. Maybe we can update their memory requests, cpu requests, and repliacs to enable them to be tested on your laptop.

I'll double check those.
The configuration test passes!

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-12-01T20:59:01Z

I tested raycluster.complete.large and raycluster.autoscaler.large on GKE -- everything lgtm.

DmitriGekhtman · 2022-12-01T22:10:06Z

CI looks nice and green. Manual testing does not reveal any issues. Merging!

Closes #805 which was caused by incomplete config cleanup in #761 . Signed-off-by: Dmitri Gekhtman <[email protected]>

…ay-project#761) This PR - Cleans up and updates sample configs and test files - Increments Ray versions to the latest Ray release (2.1.0) - Memory resource requests are increased in some sample configs. - The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed. Signed-off-by: Dmitri Gekhtman <[email protected]>

Closes ray-project#805 which was caused by incomplete config cleanup in ray-project#761 . Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman added 2 commits November 25, 2022 09:12

More documentation.

a51dae8

Signed-off-by: Dmitri Gekhtman <[email protected]>

more

82474a2

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman requested review from waleedkadous and kevin85421 November 25, 2022 17:22

kevin85421 reviewed Nov 25, 2022

View reviewed changes

Update Ray versions

13ab36e

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman added 2 commits November 26, 2022 08:36

More version updates.

aa7dd7f

Signed-off-by: Dmitri Gekhtman <[email protected]>

Add issue reference.

11307ac

Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421 reviewed Nov 26, 2022

View reviewed changes

DmitriGekhtman added 11 commits November 26, 2022 09:47

Skip untracked files in configuration test.

5aea35e

Signed-off-by: Dmitri Gekhtman <[email protected]>

Mention configuration tests in development.md

6f3ed30

Signed-off-by: Dmitri Gekhtman <[email protected]>

Clean-up

a922a17

Signed-off-by: Dmitri Gekhtman <[email protected]>

More cleanup

88d2368

Signed-off-by: Dmitri Gekhtman <[email protected]>

More version updates.

776a6f0

Signed-off-by: Dmitri Gekhtman <[email protected]>

More removal of extraneous configuration.

32228b0

Signed-off-by: Dmitri Gekhtman <[email protected]>

More cleanup

3b6ea88

Signed-off-by: Dmitri Gekhtman <[email protected]>

More cleanup

8a2e27d

Signed-off-by: Dmitri Gekhtman <[email protected]>

More updates.

4844b09

Signed-off-by: Dmitri Gekhtman <[email protected]>

Eliminate outdated example config.

77e8a76

Signed-off-by: Dmitri Gekhtman <[email protected]>

More cleanup

04ee9fe

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman changed the title ~~Revise sample configs, increase memory requests.~~ Revise sample configs, increase memory requests, update Ray versions Nov 28, 2022

Fix requirements.txt.

1b922cb

Signed-off-by: Dmitri Gekhtman <[email protected]>

tweak

2124a46

Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421 mentioned this pull request Nov 28, 2022

[Feature] Add a hook to check static Rules #767

Closed

2 tasks

More cleanup

6283ce4

Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421 reviewed Nov 29, 2022

View reviewed changes

tests/config/ray-cluster.mini.yaml.template Show resolved Hide resolved

kevin85421 reviewed Nov 29, 2022

View reviewed changes

tests/config/ray-cluster.mini.yaml.template Show resolved Hide resolved

DmitriGekhtman added 3 commits November 29, 2022 14:17

Fix selector.

f6b43d7

Signed-off-by: Dmitri Gekhtman <[email protected]>

More cleanup

0157059

Signed-off-by: Dmitri Gekhtman <[email protected]>

Rerun CI.

23c7564

DmitriGekhtman requested review from kevin85421 and Jeffwan November 29, 2022 22:57

DmitriGekhtman added 7 commits November 29, 2022 15:31

fix

095f08b

Signed-off-by: Dmitri Gekhtman <[email protected]>

reintroduce redis env

0a8a3bc

Signed-off-by: Dmitri Gekhtman <[email protected]>

Fix dash

f5843be

Signed-off-by: Dmitri Gekhtman <[email protected]>

Fix again.

b6ca57e

Signed-off-by: Dmitri Gekhtman <[email protected]>

tweak

3c8087f

Signed-off-by: Dmitri Gekhtman <[email protected]>

Add issue reference

c1172dc

Signed-off-by: Dmitri Gekhtman <[email protected]>

Merge branch 'master' into dmitri/match-memory-requests-and-limits

1f727d9

Jeffwan approved these changes Dec 1, 2022

View reviewed changes

kevin85421 approved these changes Dec 1, 2022

View reviewed changes

DmitriGekhtman added 2 commits December 1, 2022 12:34

Remove outdated templates.

608a796

Signed-off-by: Dmitri Gekhtman <[email protected]>

Swap small and large

8527ef7

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman merged commit 5d38eda into ray-project:master Dec 1, 2022

DmitriGekhtman mentioned this pull request Dec 7, 2022

[RayJob][Doc] Fix RayJob sample config. #807

Merged

4 tasks

DmitriGekhtman added a commit that referenced this pull request Dec 7, 2022

[RayJob][Doc] Fix RayJob sample config. (#807)

0274faa

Closes #805 which was caused by incomplete config cleanup in #761 . Signed-off-by: Dmitri Gekhtman <[email protected]>

kevin85421 mentioned this pull request Dec 13, 2022

[Feature] Move some functions from prototype test framework to a new utils file #837

Merged

4 tasks

kevin85421 mentioned this pull request Apr 18, 2023

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise sample configs, increase memory requests, update Ray versions #761

Revise sample configs, increase memory requests, update Ray versions #761

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

kevin85421 left a comment

kevin85421 commented Nov 25, 2022

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

kevin85421 commented Nov 26, 2022

kevin85421 left a comment

DmitriGekhtman commented Nov 28, 2022

DmitriGekhtman commented Dec 1, 2022

Jeffwan left a comment •

edited

Loading

kevin85421 left a comment •

edited

Loading

kevin85421 Dec 1, 2022

kevin85421 Dec 1, 2022

DmitriGekhtman Dec 1, 2022

kevin85421 Dec 1, 2022

kevin85421 Dec 1, 2022

DmitriGekhtman Dec 1, 2022

DmitriGekhtman commented Dec 1, 2022

DmitriGekhtman commented Dec 1, 2022

DmitriGekhtman commented Dec 1, 2022

	"ray-cluster.getting-started.yaml",
	"ray-cluster.ingress.yaml",
	"ray-cluster.mini.yaml"

	sample-yaml-config-test-2_0_0:
	needs:
	- build_operator
	- build_apiserver
	- lint
	runs-on: ubuntu-latest
	name: Sample YAML Config Test - 2.0.0
	steps:
	- name: Check out code into the Go module directory
	uses: actions/checkout@v2
	with:
	# When checking out the repository that
	# triggered a workflow, this defaults to the reference or SHA for that event.
	# Default value should work for both pull_request and merge(push) event.
	ref: ${{github.event.pull_request.head.sha}}
	- uses: ./.github/workflows/actions/configuration
	with:
	ray_version: 2.0.0

Revise sample configs, increase memory requests, update Ray versions #761

Revise sample configs, increase memory requests, update Ray versions #761

Conversation

DmitriGekhtman commented Nov 25, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Nov 25, 2022

DmitriGekhtman commented Nov 25, 2022 • edited Loading

DmitriGekhtman commented Nov 25, 2022 • edited Loading

DmitriGekhtman commented Nov 25, 2022 • edited Loading

kevin85421 commented Nov 26, 2022

kevin85421 left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented Nov 28, 2022

DmitriGekhtman commented Dec 1, 2022

Jeffwan left a comment • edited Loading

Choose a reason for hiding this comment

kevin85421 left a comment • edited Loading

Choose a reason for hiding this comment

kevin85421 Dec 1, 2022

Choose a reason for hiding this comment

kevin85421 Dec 1, 2022

Choose a reason for hiding this comment

DmitriGekhtman Dec 1, 2022

Choose a reason for hiding this comment

kevin85421 Dec 1, 2022

Choose a reason for hiding this comment

kevin85421 Dec 1, 2022

Choose a reason for hiding this comment

DmitriGekhtman Dec 1, 2022

Choose a reason for hiding this comment

DmitriGekhtman commented Dec 1, 2022

DmitriGekhtman commented Dec 1, 2022

DmitriGekhtman commented Dec 1, 2022

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

DmitriGekhtman commented Nov 25, 2022 •

edited

Loading

Jeffwan left a comment •

edited

Loading

kevin85421 left a comment •

edited

Loading