Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted #1412

Merged
merged 7 commits into from
Sep 15, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Sep 11, 2023

Why are these changes needed?

TODO

  • Kubernetes Job configurations, e.g. backoffLimit, completions, ... etc.
  • Documentations (including how to delete it if it fails)
  • E2E tests
  • Improve observability
  • Test if the Job fails to connect to Redis.

Related issue number

Closes #1286

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 0: Create a Kubernetes cluster
# Step 1: Install a KubeRay operator with this PR
# Step 2: Create a GCS FT-enabled 
# (path: ray-operator/config/samples)
kubectl apply -f ray-cluster.external-redis.yaml

# Step 3: Wait until head Pod is running and ready.

# Step 4: Log in to the Redis Pod
kubectl exec -it $REDIS_POD -- bash

# Step 5: Check Redis data (in the Redis Pod)
redis-cli -a "5241590000000000"
KEYS *
# [Example output]: 1) "4c31680c-2a27-4788-92d8-6867188ecdd8"
HGETALL  "4c31680c-2a27-4788-92d8-6867188ecdd8"

# [Example output]
# 57) "4c31680c-2a27-4788-92d8-6867188ecdd8@KV:@namespace_dashboard:DASHBOARD_AGENT_PORT_PREFIX:19ca12675b387fb9e69ba6613b6c47ebc38ce441176b931fb052e09e"
# 58) "[52365, 43258]"
# 59) "4c31680c-2a27-4788-92d8-6867188ecdd8@KV:@namespace_usage_stats:extra_usage_tag_dashboard_used"
# 60) "False"

# Step 6: Delete the RayCluster
kubectl delete rayclusters.ray.io raycluster-external-redis 

# Step 7: Repeat Step 4 and Step 5
KEYS *
# [Expected output]: (empty list or set)

@kevin85421 kevin85421 changed the title [WIP][Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted [Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted Sep 12, 2023
@kevin85421 kevin85421 marked this pull request as ready for review September 12, 2023 07:22
@kevin85421
Copy link
Member Author

cc @smit-kiri @JoshKarpel

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few minor questions. Feel free to address them in a followup PR if needed

@@ -51,17 +51,10 @@ func GetHeadPort(headStartParams map[string]string) string {
return headPort
}

// rayClusterHAEnabled check if RayCluster enabled FT in annotations
func rayClusterHAEnabled(instance rayv1alpha1.RayCluster) bool {
if instance.Annotations == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need this check anymore? Won't we get an error if Annotations is nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nil map in Golang can be a bit tricky. It won't throw an exception if you look up a nil map, but it will throw an exception when you attempt to write to a nil map.

A nil map behaves like an empty map when reading, but attempts to write to a nil map will cause a runtime panic;

  • ref: https://go.dev/blog/maps

  • example:

    package main
    
    import "fmt"
    
    func main() {
       var rect map[string]int
       val, ok := rect["hi"] 
       fmt.Println(val, ok) // 0, false
       rect["height"] = 10 // panic: assignment to entry in nil map
       fmt.Println(rect["height"])
    }

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
pod.Spec.Containers[common.RayContainerIndex].Command = []string{"/bin/bash", "-lc", "--"}
pod.Spec.Containers[common.RayContainerIndex].Args = []string{
"python -c " +
"\"from ray._private.gcs_utils import cleanup_redis_storage; " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that we have to use a private API for this. Do you know if we test this codepath in the Kuberay repo against the Ray nightly wheels? That will help us catch the error if the private function changes its behavior or name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a good first issue.
#1422

if len(redisCleanupJobs.Items) != 0 {
// Check whether the Redis cleanup Job has been completed.
redisCleanupJob := redisCleanupJobs.Items[0]
if redisCleanupJob.Status.Succeeded > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something if the status is Failed? (Maybe log a message at least?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated aca1b0a

Co-authored-by: Archit Kulkarni <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421 kevin85421 merged commit 72ba3a3 into ray-project:master Sep 15, 2023
15 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…deleted (ray-project#1412)

Clean up Redis once a GCS FT-Enabled RayCluster is deleted.
architkulkarni added a commit that referenced this pull request Oct 2, 2023
My understanding is that the RayCluster sample YAML test framework only adds RayCluster CRs, but doesn't add the other resources in the sample YAML file (for example the external redis deployment). In the case of the external redis sample YAML, the sample YAML test started failing after #1412 and my tentative hypothesis is that the cleanup job added by the PR hangs if there's no external redis.

For now, we should merge this PR to unbreak CI. Later, we can decide whether to properly support an end-to-end external redis test.

Related issue number
Closes #1459

Signed-off-by: Archit Kulkarni <[email protected]>
kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Oct 17, 2023
My understanding is that the RayCluster sample YAML test framework only adds RayCluster CRs, but doesn't add the other resources in the sample YAML file (for example the external redis deployment). In the case of the external redis sample YAML, the sample YAML test started failing after ray-project#1412 and my tentative hypothesis is that the cleanup job added by the PR hangs if there's no external redis.

For now, we should merge this PR to unbreak CI. Later, we can decide whether to properly support an end-to-end external redis test.

Related issue number
Closes ray-project#1459

Signed-off-by: Archit Kulkarni <[email protected]>
architkulkarni pushed a commit to ray-project/ray that referenced this pull request Oct 17, 2023
Add a section to explain ray-project/kuberay#1412.
---------

Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Hongchao Deng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Deleting RayService does not clear Redis cache
2 participants