[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

kevin85421 · 2022-11-03T22:04:24Z

Why are these changes needed?

Use #605 to test sample RayCluster YAMLs. This PR found:

Invalid YAML: Remove ray-cluster.without-block.yaml #675
Outdated YAML: ray-cluster.getting-started.yaml
Bug: [Bug] label rayNodeType is useless #679

Related issue number

Closes #642

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

cd tests/framework
python3 test_sample_raycluster_yamls.py 2>&1 | tee log

ray-operator/README.md

DmitriGekhtman · 2022-11-04T01:28:08Z

Will lgtm once you get the new test to pass consistently.

Co-authored-by: Dmitri Gekhtman <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 · 2022-11-04T16:14:21Z

@DmitriGekhtman There are 4 tests hit the CPU constraint ("a standard Linux runner has 2-core CPU (x86_64), 7GB of RAM, and 14GB of SSD space.") of GitHub Actions as we discussed in the Design: KubeRay E2E Configuration tests.

I used os.system(f'kubectl describe pods -n={self.namespace}') to gather more information when RayClusterAddCREvent fails to converge. You can search "Insufficient cpu" in CI result.

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  8s (x3 over 92s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

DmitriGekhtman · 2022-11-04T18:37:25Z

Ok, it seems that to automate test execution, we will have to run these tests in the Ray CI.

Here's a proposed sequence of actions.

For this PR, make sure the tests are passing manually when you run them. You can remove the build step.

After merging this PR, the main priority is manual release testing:

The Ray 2.1.0 release is in progress and a more-or-less final rayproject/ray:2.1.0 image is available.
Run the tests with the 2.1.0 images manually and make sure they pass.
(It is possible that some minor commits will be picked in and the image will be modified before release... just ignore that for now. Or if you like repeat the tests when the Ray team is approaching the final 2.1.0 release -- should be within the next few days; you can check the relevant Anyscale-internal slack channel for details.)

Prior to the KubeRay 0.4.0 release, run the tests manually again with a KubeRay 0.4.0 candidate and Ray 2.1.0.

The next priority is establishing automated pipelines for running these tests in the Ray CI. It's important to track this and do it eventually, but it's fine if we don't put a deadline on it right now.

DmitriGekhtman · 2022-11-04T19:33:40Z

If there's a way to fit some subset of the test into CI, perhaps with modifications to the configs, that would also be good.

kevin85421 · 2022-11-06T00:23:00Z

If there's a way to fit some subset of the test into CI, perhaps with modifications to the configs, that would also be good.

This PR runs 3 tests on KubeRay CI, and I will open issue #695 to track the progress of running tests on Ray CI.

Although each standard Linux runner has 2 cores, I always failed to schedule Pods when CPU usage is higher than 800m (0.8 CPU). Maybe there are some CPU fragmentation problems.

DmitriGekhtman

Looks good!

raycluster.complete.yaml and raycluster.autoscaler.yaml are important ones to test in the Ray CI.

davidxia · 2022-11-08T02:04:25Z

Thanks, just wondering if this would've caught the bug fixed by #501?

DmitriGekhtman · 2022-11-08T02:26:31Z

Probably not in its current form, but it wouldn't be too hard to extend the framework to validate log volume mounts.

…e ones (ray-project#678) Use ray-project#605 to test sample RayCluster YAMLs. Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 added 6 commits November 3, 2022 21:44

init

a6872f8

update

f23a485

update doc

62b7729

update

a07d36d

fix

8f1ff6e

github actions

b882d63

kevin85421 marked this pull request as ready for review November 4, 2022 01:04

DmitriGekhtman reviewed Nov 4, 2022

View reviewed changes

ray-operator/README.md Outdated Show resolved Hide resolved

kevin85421 and others added 7 commits November 4, 2022 03:45

update

e59892e

update

d502835

add requirements.txt

2a389da

Update ray-operator/README.md

0397529

Co-authored-by: Dmitri Gekhtman <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>

update

d004673

fix

2de7525

update

1e7d66a

kevin85421 added 2 commits November 5, 2022 17:53

reduce CPU requests

29e7710

update

beda055

kevin85421 mentioned this pull request Nov 6, 2022

[Feature] Migrate CI infra from GitHub Actions to Buildkite #695

Closed

2 tasks

DmitriGekhtman approved these changes Nov 7, 2022

View reviewed changes

DmitriGekhtman merged commit b4b1ce7 into ray-project:master Nov 7, 2022

kevin85421 mentioned this pull request Nov 14, 2022

[Feature] Test sample RayService YAMLs to catch invalid or out of date ones #719

Closed

2 tasks

kevin85421 mentioned this pull request Dec 27, 2022

[Feature] Define a general-purpose cleanup method for CREvent #849

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

kevin85421 commented Nov 3, 2022 •

edited

Loading

DmitriGekhtman commented Nov 4, 2022

kevin85421 commented Nov 4, 2022

DmitriGekhtman commented Nov 4, 2022 •

edited

Loading

DmitriGekhtman commented Nov 4, 2022

kevin85421 commented Nov 6, 2022

DmitriGekhtman left a comment

davidxia commented Nov 8, 2022

DmitriGekhtman commented Nov 8, 2022

[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

[Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones #678

Conversation

kevin85421 commented Nov 3, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman commented Nov 4, 2022

kevin85421 commented Nov 4, 2022

DmitriGekhtman commented Nov 4, 2022 • edited Loading

DmitriGekhtman commented Nov 4, 2022

kevin85421 commented Nov 6, 2022

DmitriGekhtman left a comment

Choose a reason for hiding this comment

davidxia commented Nov 8, 2022

DmitriGekhtman commented Nov 8, 2022

kevin85421 commented Nov 3, 2022 •

edited

Loading

DmitriGekhtman commented Nov 4, 2022 •

edited

Loading