Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes #1269

Merged
merged 16 commits into from
Aug 8, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Jul 25, 2023

Why are these changes needed?

This PR adds an end to end test for shutdownAfterJobFinishes. It tests that the RayCluster is deleted after the job completes.

Potential followups:

  • Test that the underlying Kubernetes job (the submitterPod which runs ray job submit) is also deleted

Related issue number

Closes #1200

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
# }
#}'
# ShutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
shutdownAfterJobFinishes: true
Copy link
Contributor Author

@architkulkarni architkulkarni Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change in the new sample YAML is setting shutdownAfterJobFinishes to true and setting ttlSecondsAfterFinished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal to copy the entire YAML file and only change one line, because if this YAML file needs to be updated then it's hard to remember to update the other one. Another approach would be to update shutdownAfterJobFinishes in the original YAML file at runtime, but it doesn't seem to fit in well with the current test framework (which accepts a list of sample YAML files to test.)

@architkulkarni architkulkarni marked this pull request as ready for review July 25, 2023 18:34
@architkulkarni
Copy link
Contributor Author

@kevin85421
Copy link
Member

Is this necessary to be included in v0.6.0?

@architkulkarni
Copy link
Contributor Author

Is this necessary to be included in v0.6.0?

No, it just adds a test. No code changes were necessary to make the test pass.

Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni
Copy link
Contributor Author

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: rayjob-sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could come up with a new name for this CR.

name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think we can remove it. I think it's the default Ray Serve port, but it's not needed for Ray Job.

class ShutdownJobRule(Rule):
"""Check the Ray cluster is shutdown when setting `spec.shutdownAfterJobFinishes` to true."""
def assert_rule(self, custom_resource=None, cr_namespace='default'):
custom_api = client.CustomObjectsApi()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple K8s API client instances will introduce unstability. You can reuse K8S_CLUSTER_MANAGER.k8s_client_dict[CONST.K8S_CR_CLIENT_KEY] instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know this, thanks for the heads up!

converge = False
k8s_v1_api = K8S_CLUSTER_MANAGER.k8s_client_dict[CONST.K8S_V1_CLIENT_KEY]
custom_api = client.CustomObjectsApi()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

steps = "spec.shutdownAfterJobFinishes".split('.')
value = search_path(custom_resource, steps)
logger.info("ShutdownJobRule trigger_condition(): %s", value)
return value is not None and value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value seems to be a string here. In my understanding, value is not None and value will return a string.

Screen Shot 2023-08-01 at 11 17 12 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually value seems to be a bool. I added an assertion just to be safe: https://github.com/ray-project/kuberay/pull/1269/files#diff-665dbdd284a6d0222ec3ddb2710957546d6c3974808765d71fb73d119d4cbef8R277

assert isinstance(value, bool) or value is None

class Counter:
def __init__(self):
# Used to verify runtimeEnv
self.name = os.getenv("counter_name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can also add one in the requests version check

@architkulkarni
Copy link
Contributor Author

@kevin85421 Thanks for the review. The new CR is just a copy of the existing RayJob sample CR, the only change is the shutdownAfterJobFinishes field.

In general what's our philosophy for how to write tests for these fields? Maintaining a whole new CR with just one field changed doesn't seem like the best approach. But the e2e test framework requires a list of sample YAMLs as input. Is it acceptable to create the modified sample YAML file at runtime?

@kevin85421
Copy link
Member

Is it acceptable to create the modified sample YAML file at runtime?

Yes.
https://github.com/ray-project/kuberay/blob/master/tests/test_sample_rayservice_yamls.py#L223-L227

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Aug 2, 2023

@kevin85421 Thanks for the review! Comments have been addressed, please take another look.

For now I just included two sample YAML files, but in the future when we add more tests I can refactor the RayJob test runner to look more like the RayService test.

The test passed with the expected behavior:

Test 0 (no shutdown): https://github.com/ray-project/kuberay/actions/runs/5742686996/job/15566967892#step:3:1571

2023-08-02:20:12:49,990 INFO     [prototype.py:276] ShutdownJobRule trigger_condition(): None

Test 1 (shutdown): https://github.com/ray-project/kuberay/actions/runs/5742686996/job/15566967892#step:3:1596

2023-08-02:20:13:54,239 INFO     [prototype.py:276] ShutdownJobRule trigger_condition(): True
2023-08-02:20:13:54,239 INFO     [prototype.py:253] Waiting for RayCluster to be deleted...
[...]
2023-08-02:20:14:00,286 INFO     [prototype.py:260] ShutdownJobRule wait() hasn't converged yet.
2023-08-02:20:14:00,287 INFO     [prototype.py:261] Number of RayClusters: 1
2023-08-02:20:14:01,294 INFO     [prototype.py:260] ShutdownJobRule wait() hasn't converged yet.
2023-08-02:20:14:01,294 INFO     [prototype.py:261] Number of RayClusters: 0
2023-08-02:20:14:01,294 INFO     [prototype.py:268] RayCluster has been deleted.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Would you mind opening a follow-up PR to link the YAML files as examples in the RayJob doc?

@architkulkarni architkulkarni merged commit 7386427 into ray-project:master Aug 8, 2023
18 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…roject#1269)

This PR adds an end to end test for shutdownAfterJobFinishes. It tests that the RayCluster is deleted after the job completes.

Potential followups:

Test that the underlying Kubernetes job (the submitterPod which runs ray job submit) is also deleted
Related issue number
Closes ray-project#1200

---------

Signed-off-by: Archit Kulkarni <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RayJob] Add testing for shutdownAfterJobFinishes
2 participants