Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 3, 2022

Why are these changes needed?

Use #605 to test sample RayCluster YAMLs. This PR found:

Related issue number

Closes #642

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
cd tests/framework
python3 test_sample_raycluster_yamls.py 2>&1 | tee log

Screen Shot 2022-11-03 at 4 29 53 PM

@kevin85421 kevin85421 marked this pull request as ready for review November 4, 2022 01:04
@DmitriGekhtman
Copy link
Collaborator

Will lgtm once you get the new test to pass consistently.

@kevin85421
Copy link
Member Author

@DmitriGekhtman There are 4 tests hit the CPU constraint ("a standard Linux runner has 2-core CPU (x86_64), 7GB of RAM, and 14GB of SSD space.") of GitHub Actions as we discussed in the Design: KubeRay E2E Configuration tests.

I used os.system(f'kubectl describe pods -n={self.namespace}') to gather more information when RayClusterAddCREvent fails to converge. You can search "Insufficient cpu" in CI result.

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  8s (x3 over 92s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Nov 4, 2022

Ok, it seems that to automate test execution, we will have to run these tests in the Ray CI.

Here's a proposed sequence of actions.

For this PR, make sure the tests are passing manually when you run them. You can remove the build step.

After merging this PR, the main priority is manual release testing:

The Ray 2.1.0 release is in progress and a more-or-less final rayproject/ray:2.1.0 image is available.
Run the tests with the 2.1.0 images manually and make sure they pass.
(It is possible that some minor commits will be picked in and the image will be modified before release... just ignore that for now. Or if you like repeat the tests when the Ray team is approaching the final 2.1.0 release -- should be within the next few days; you can check the relevant Anyscale-internal slack channel for details.)

Prior to the KubeRay 0.4.0 release, run the tests manually again with a KubeRay 0.4.0 candidate and Ray 2.1.0.

The next priority is establishing automated pipelines for running these tests in the Ray CI. It's important to track this and do it eventually, but it's fine if we don't put a deadline on it right now.

@DmitriGekhtman
Copy link
Collaborator

If there's a way to fit some subset of the test into CI, perhaps with modifications to the configs, that would also be good.

@kevin85421
Copy link
Member Author

If there's a way to fit some subset of the test into CI, perhaps with modifications to the configs, that would also be good.

This PR runs 3 tests on KubeRay CI, and I will open issue #695 to track the progress of running tests on Ray CI.

Although each standard Linux runner has 2 cores, I always failed to schedule Pods when CPU usage is higher than 800m (0.8 CPU). Maybe there are some CPU fragmentation problems.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

raycluster.complete.yaml and raycluster.autoscaler.yaml are important ones to test in the Ray CI.

@DmitriGekhtman DmitriGekhtman merged commit b4b1ce7 into ray-project:master Nov 7, 2022
@davidxia
Copy link
Contributor

davidxia commented Nov 8, 2022

Thanks, just wondering if this would've caught the bug fixed by #501?

@DmitriGekhtman
Copy link
Collaborator

Probably not in its current form, but it wouldn't be too hard to extend the framework to validate log volume mounts.

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…e ones (ray-project#678)

Use ray-project#605 to test sample RayCluster YAMLs.

Signed-off-by: Kai-Hsun Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test sample RayCluster YAMLs to catch invalid or out of date ones
3 participants