-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted #1412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few minor questions. Feel free to address them in a followup PR if needed
@@ -51,17 +51,10 @@ func GetHeadPort(headStartParams map[string]string) string { | |||
return headPort | |||
} | |||
|
|||
// rayClusterHAEnabled check if RayCluster enabled FT in annotations | |||
func rayClusterHAEnabled(instance rayv1alpha1.RayCluster) bool { | |||
if instance.Annotations == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need this check anymore? Won't we get an error if Annotations is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nil map in Golang can be a bit tricky. It won't throw an exception if you look up a nil map, but it will throw an exception when you attempt to write to a nil map.
A nil map behaves like an empty map when reading, but attempts to write to a nil map will cause a runtime panic;
-
example:
package main import "fmt" func main() { var rect map[string]int val, ok := rect["hi"] fmt.Println(val, ok) // 0, false rect["height"] = 10 // panic: assignment to entry in nil map fmt.Println(rect["height"]) }
pod.Spec.Containers[common.RayContainerIndex].Command = []string{"/bin/bash", "-lc", "--"} | ||
pod.Spec.Containers[common.RayContainerIndex].Args = []string{ | ||
"python -c " + | ||
"\"from ray._private.gcs_utils import cleanup_redis_storage; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that we have to use a private API for this. Do you know if we test this codepath in the Kuberay repo against the Ray nightly wheels? That will help us catch the error if the private function changes its behavior or name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open a good first issue.
#1422
if len(redisCleanupJobs.Items) != 0 { | ||
// Check whether the Redis cleanup Job has been completed. | ||
redisCleanupJob := redisCleanupJobs.Items[0] | ||
if redisCleanupJob.Status.Succeeded > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something if the status is Failed? (Maybe log a message at least?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated aca1b0a
Co-authored-by: Archit Kulkarni <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>
…deleted (ray-project#1412) Clean up Redis once a GCS FT-Enabled RayCluster is deleted.
My understanding is that the RayCluster sample YAML test framework only adds RayCluster CRs, but doesn't add the other resources in the sample YAML file (for example the external redis deployment). In the case of the external redis sample YAML, the sample YAML test started failing after #1412 and my tentative hypothesis is that the cleanup job added by the PR hangs if there's no external redis. For now, we should merge this PR to unbreak CI. Later, we can decide whether to properly support an end-to-end external redis test. Related issue number Closes #1459 Signed-off-by: Archit Kulkarni <[email protected]>
My understanding is that the RayCluster sample YAML test framework only adds RayCluster CRs, but doesn't add the other resources in the sample YAML file (for example the external redis deployment). In the case of the external redis sample YAML, the sample YAML test started failing after ray-project#1412 and my tentative hypothesis is that the cleanup job added by the PR hangs if there's no external redis. For now, we should merge this PR to unbreak CI. Later, we can decide whether to properly support an end-to-end external redis test. Related issue number Closes ray-project#1459 Signed-off-by: Archit Kulkarni <[email protected]>
Add a section to explain ray-project/kuberay#1412. --------- Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Hongchao Deng <[email protected]>
Why are these changes needed?
TODO
backoffLimit
,completions
, ... etc.Related issue number
Closes #1286
Checks