-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes #1269
[RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes #1269
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
# } | ||
#}' | ||
# ShutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false. | ||
shutdownAfterJobFinishes: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change in the new sample YAML is setting shutdownAfterJobFinishes
to true and setting ttlSecondsAfterFinished
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal to copy the entire YAML file and only change one line, because if this YAML file needs to be updated then it's hard to remember to update the other one. Another approach would be to update shutdownAfterJobFinishes
in the original YAML file at runtime, but it doesn't seem to fit in well with the current test framework (which accepts a list of sample YAML files to test.)
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
I also tried to add a test in https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller_test.go but I couldn't get it to work. I think it might be related to https://book.kubebuilder.io/reference/envtest.html#testing-considerations. |
Is this necessary to be included in v0.6.0? |
No, it just adds a test. No code changes were necessary to make the test pass. |
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
apiVersion: ray.io/v1alpha1 | ||
kind: RayJob | ||
metadata: | ||
name: rayjob-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could come up with a new name for this CR.
name: dashboard | ||
- containerPort: 10001 | ||
name: client | ||
- containerPort: 8000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I think we can remove it. I think it's the default Ray Serve port, but it's not needed for Ray Job.
tests/framework/prototype.py
Outdated
class ShutdownJobRule(Rule): | ||
"""Check the Ray cluster is shutdown when setting `spec.shutdownAfterJobFinishes` to true.""" | ||
def assert_rule(self, custom_resource=None, cr_namespace='default'): | ||
custom_api = client.CustomObjectsApi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple K8s API client instances will introduce unstability. You can reuse K8S_CLUSTER_MANAGER.k8s_client_dict[CONST.K8S_CR_CLIENT_KEY]
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know this, thanks for the heads up!
tests/framework/prototype.py
Outdated
converge = False | ||
k8s_v1_api = K8S_CLUSTER_MANAGER.k8s_client_dict[CONST.K8S_V1_CLIENT_KEY] | ||
custom_api = client.CustomObjectsApi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
steps = "spec.shutdownAfterJobFinishes".split('.') | ||
value = search_path(custom_resource, steps) | ||
logger.info("ShutdownJobRule trigger_condition(): %s", value) | ||
return value is not None and value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually value
seems to be a bool
. I added an assertion just to be safe: https://github.com/ray-project/kuberay/pull/1269/files#diff-665dbdd284a6d0222ec3ddb2710957546d6c3974808765d71fb73d119d4cbef8R277
assert isinstance(value, bool) or value is None
class Counter: | ||
def __init__(self): | ||
# Used to verify runtimeEnv | ||
self.name = os.getenv("counter_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I can also add one in the requests
version check
@kevin85421 Thanks for the review. The new CR is just a copy of the existing RayJob sample CR, the only change is the In general what's our philosophy for how to write tests for these fields? Maintaining a whole new CR with just one field changed doesn't seem like the best approach. But the e2e test framework requires a list of sample YAMLs as input. Is it acceptable to create the modified sample YAML file at runtime? |
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>
|
Signed-off-by: Archit Kulkarni <[email protected]>
…/kuberay into add-shutdown-test
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@kevin85421 Thanks for the review! Comments have been addressed, please take another look. For now I just included two sample YAML files, but in the future when we add more tests I can refactor the RayJob test runner to look more like the The test passed with the expected behavior: Test 0 (no shutdown): https://github.com/ray-project/kuberay/actions/runs/5742686996/job/15566967892#step:3:1571
Test 1 (shutdown): https://github.com/ray-project/kuberay/actions/runs/5742686996/job/15566967892#step:3:1596
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Would you mind opening a follow-up PR to link the YAML files as examples in the RayJob doc?
…roject#1269) This PR adds an end to end test for shutdownAfterJobFinishes. It tests that the RayCluster is deleted after the job completes. Potential followups: Test that the underlying Kubernetes job (the submitterPod which runs ray job submit) is also deleted Related issue number Closes ray-project#1200 --------- Signed-off-by: Archit Kulkarni <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
Why are these changes needed?
This PR adds an end to end test for
shutdownAfterJobFinishes
. It tests that the RayCluster is deleted after the job completes.Potential followups:
ray job submit
) is also deletedRelated issue number
Closes #1200
Checks