Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise sample configs, increase memory requests, update Ray versions #761

Conversation

DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman commented Nov 25, 2022

Signed-off-by: Dmitri Gekhtman [email protected]

Why are these changes needed?

This PR

  • Cleans up and updates sample configs and test files
  • Increments Ray versions to the latest Ray release (2.1.0)

Memory resource requests are increased in some sample configs.
The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @DmitriGekhtman for this contribution! Would you mind to update:

(1) rayVersion and image for all sample YAML files in ray-operator/config/samples

(2) Update the following line to the latest ray release (i.e. 2.1.0)

os.getenv('RAY_IMAGE', default='rayproject/ray:2.0.0'),

(3) Run configuration tests on your local machine. Currently, only 3 small configuration YAMLs can be tested in KubeRay CI. We can test ray-cluster.autoscaler.yaml and ray-cluster.complete.yaml by running the following command in your local environment:

python3 test_sample_raycluster_yamls.py 2>&1 | tee log

"ray-cluster.getting-started.yaml",
"ray-cluster.ingress.yaml",
"ray-cluster.mini.yaml"

@kevin85421
Copy link
Member

By the way, would you mind describing more about the relationship between increasing memory requests and OOM? Thanks!

Without the context of GKE, it is hard to understand. For me, I thought a Pod will be killed by OOM when

(1) Memory usage exceeds memory limit. => In this case, making memory request equal to memory limit cannot resolve OOM.

(2) When the cluster runs out of resources, Pods with memory usage (>= memory request, but <= memory limit) will be killed before those Pods with memory usage (<= memory request). => In this case, making memory request equal to memory limit will make other Pods OOM or Pods cannot be scheduled.

Only in the context of GKE autopilot, it makes sense to avoid OOM by increasing memory requests. Thanks!

@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Nov 25, 2022

Right, there are two related motivations:

  1. GKE autopilot has a tendency to automatically decrease memory limits to requests, which will cause unexpected OOMs.

  2. Ray has no understanding of limits vs. requests, as Ray was designed with VMs in mind. KubeRay advertises the memory limit to Ray as the Ray node's memory capacity and that's all that Ray sees. Thus, for consistency of Ray's behavior, it's best to simply line up limits and requests.
    Otherwise, whether your Ray pod OOMs or not is sensitive to the K8s environment in ways that are difficult to understand.
    [K8s container limit] == [K8s container request] == [Ray node resource capacity] is much easier to work with.

@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Nov 25, 2022

(2) When the cluster runs out of resources, Pods with memory usage (>= memory request, but <= memory limit) will be killed before those Pods with memory usage (<= memory request). => In this case, making memory request equal to memory limit will make other Pods OOM or Pods cannot be scheduled.

In particular, if you don't set requests=limits, you're prone to unexpected results due to the K8s behavior you've described.
You are more prone to noisy neighbors issues that depend on what your K8s nodes look like and what other workloads are running on these nodes. So when a Ray worker OOMs, you won't know if you made a mistake setting up your Ray workload or if another user was running a resource-intensive application that overlapped on the same physical K8s nodes.

Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Nov 25, 2022

TODO: Run configuration tests.

I'll also add a blurb to the development docs about these tests (if that's not there already)

@kevin85421
Copy link
Member

The update looks good to me. Feel free to merge it if the configuration tests pass.

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Update the version of Ray from 2.0.0 to 2.1.0 in configuration test CI.

    sample-yaml-config-test-2_0_0:
    needs:
    - build_operator
    - build_apiserver
    - lint
    runs-on: ubuntu-latest
    name: Sample YAML Config Test - 2.0.0
    steps:
    - name: Check out code into the Go module directory
    uses: actions/checkout@v2
    with:
    # When checking out the repository that
    # triggered a workflow, this defaults to the reference or SHA for that event.
    # Default value should work for both pull_request and merge(push) event.
    ref: ${{github.event.pull_request.head.sha}}
    - uses: ./.github/workflows/actions/configuration
    with:
    ray_version: 2.0.0

  • Update the default Ray image from 2.0.0 to 2.1.0 in configuration test scripts.

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman DmitriGekhtman changed the title Revise sample configs, increase memory requests. Revise sample configs, increase memory requests, update Ray versions Nov 28, 2022
Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

Need to debug some of the tests -- great to see the tests catching misconfigurations.

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

Woooooo, the CI is green after #759
Nice work @kevin85421. Would you mind giving this PR a final review?

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most parts look good to me. I did't check test framework part. leave it to Kevin

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but it is hard for me to verify this big PR. The good thing is that I will check all sample YAML files [1] before v0.4.0 release except ray-cluster.complete.large.yaml and ray-cluster.autoscaler.large.yaml. Hence, feel free to merge this PR if you double-check the two large YAML files. Maybe we can update their memory requests, cpu requests, and repliacs to enable them to be tested on your laptop.

[1] https://docs.google.com/spreadsheets/d/1wlTXCWNtQxCUENa0fP2-dV6UYNUhUCix4exiGsep5GQ/edit#gid=1603824512

@@ -44,30 +44,13 @@ spec:
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out!

@@ -44,30 +44,18 @@ spec:
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@DmitriGekhtman
Copy link
Collaborator Author

It looks good to me, but it is hard for me to verify this big PR. The good thing is that I will check all sample YAML files [1] before v0.4.0 release except ray-cluster.complete.large.yaml and ray-cluster.autoscaler.large.yaml. Hence, feel free to merge this PR if you double-check the two large YAML files. Maybe we can update their memory requests, cpu requests, and repliacs to enable them to be tested on your laptop.

I'll double check those.
The configuration test passes!

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

I tested raycluster.complete.large and raycluster.autoscaler.large on GKE -- everything lgtm.

@DmitriGekhtman
Copy link
Collaborator Author

CI looks nice and green. Manual testing does not reveal any issues. Merging!

@DmitriGekhtman DmitriGekhtman merged commit 5d38eda into ray-project:master Dec 1, 2022
DmitriGekhtman added a commit that referenced this pull request Dec 7, 2022
Closes #805 which was caused by incomplete config cleanup in #761 .

Signed-off-by: Dmitri Gekhtman <[email protected]>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ay-project#761)

This PR

- Cleans up and updates sample configs and test files
- Increments Ray versions to the latest Ray release (2.1.0)
- Memory resource requests are increased in some sample configs.
- The Ray head pod has a high risk of running out of memory if it is allocated less than 2Gb memory, so that's fixed.

Signed-off-by: Dmitri Gekhtman <[email protected]>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Closes ray-project#805 which was caused by incomplete config cleanup in ray-project#761 .

Signed-off-by: Dmitri Gekhtman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants