Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Deleting RayService does not clear Redis cache #1286

Closed
1 of 2 tasks
Tracked by #1033
smit-kiri opened this issue Aug 2, 2023 · 11 comments · Fixed by #1412
Closed
1 of 2 tasks
Tracked by #1033

[Bug] Deleting RayService does not clear Redis cache #1286

smit-kiri opened this issue Aug 2, 2023 · 11 comments · Fixed by #1412
Assignees
Labels
bug Something isn't working gcs ft rayservice

Comments

@smit-kiri
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When deleting RayService with GCS fault tolerance using kubectl delete rayservice xxxx command, the Redis cache isn't cleared. So if we deploy a new RayService later with a different config, the older RayService is restored, ignoring the current config.

Reproduction script

Deploy any RayService with RAY_REDIS_ADDRESS set. Delete the RayService using kubectl delete rayservice rayservice_sample.

Change the serveConfigV2 with completely new deployments / applications, and apply the RayService with the same RAY_REDIS_ADDRESS and you'll notice the old RayService being deployed.

Anything else

This is a slight inconvenience, since we're only deleting and re-creating RayService in a dev environment for testing purposes and we cannot use Redis there. To keep using redis, we need to reboot the redis node whenever we delete the RayService so the cache is cleared.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@smit-kiri
Copy link
Author

This also leads to a situation where the redis database memory usage keep increasing as we update our RayService. Ideally when KubeRay switches over traffic to the new RayCluster and deletes the old one, it should also clear GCS cache for the old cluster.

Screenshot 2023-08-17 at 7 19 07 PM

@JoshKarpel
Copy link
Contributor

This also leads to a situation where the redis database memory usage keep increasing as we update our RayService. Ideally when KubeRay switches over traffic to the new RayCluster and deletes the old one, it should also clear GCS cache for the old cluster.

Big plus one on that from me - this seems like it will be a common problem across all users of KubeRay + GCS FT that everyone will otherwise have to solve themselves.

Redis key expiry might work here too, if the GCS key has a (long) expiration that is refreshed regularly (by the head node maybe?).

@kevin85421 kevin85421 self-assigned this Aug 23, 2023
@kevin85421
Copy link
Member

Change the serveConfigV2 with completely new deployments / applications, and apply the RayService with the same RAY_REDIS_ADDRESS and you'll notice the old RayService being deployed.

Is it also related to the same value of ray.io/external-storage-namespace for both old/new RayService custom resources?

#1286 (comment)

cc @iycheng @edoakes

Who should take the responsibility to clear the Redis cache: KubeRay, Users, or Ray? Thanks!

@smit-kiri
Copy link
Author

Is it also related to the same value of ray.io/external-storage-namespace for both old/new RayService custom resources?

Yes, but even if you change the namespace, the data in the old namespace does not go away. The memory usage just keeps increasing

@kevin85421
Copy link
Member

I discussed this with Ray Core folk @iycheng. Ray provides a private util function cleanup_redis_storage to delete the storage namespace in Redis. However, it cannot fully delete the storage namespace if the GCS process on the head Pod is still running. We discussed some possible solutions:

  1. KubeRay sends requests to delete storage namespace in Redis: This might not be effective, as KubeRay might lack access to Redis.
  2. KubeRay submits a job to the RayCluster to call the function cleanup_redis_storage: While this approach can remove some data from the storage namespace, it can't completely delete the namespace because GCS is still running.
  3. Kill the GCS process and call cleanup_redis_storage: If everything goes well, the storage namespace can be fully deleted. However, Ray head Pod will crash and restart if it cannot connect to GCS for more than RAY_gcs_rpc_server_reconnect_timeout_s seconds (60 by default). Typically, the cleanup process is pretty fast, but we still cannot guarantee that the storage namespace can always be deleted.
  4. Pod preStop hook: In my understanding, the hook will be triggered before the Pod receives the TERM signal, and there seems to be a 30-second timeout for the hook. In addition, the hook needs to know whether it is triggered because of an accidental crash or an intentional cluster deletion. For the former, we should not clean up Redis. For the later, we need to clean up the Redis.
  5. Create a Kubernetes Job for the RayCluster to clean up Redis: This seems to be the best KubeRay solution although it needs to create an additional Kubernetes Job and its lifecycle is async with Pods in the Ray cluster.
  6. Update Ray Core: Ideally, when I call cleanup_redis_storage to clean up the storage namespace, GCS should stop writing data to the external Redis. However, it seems to be not easy to implement (?).

My current thought is to implement "Create a Kubernetes Job for the RayCluster to clean up Redis". To elaborate,

  • Add a finalizer to the RayCluster CR if GCS FT is enabled.
  • If KubeRay receives the CR deletion event, delete all Pods (head / worker) belong to the RayCluster.
  • After all Pods are gone, create a Kubernetes Job to clean up the storage namespace in Redis.
  • If the job succeeds, remove the finalizer. Otherwise, leave the RayCluster there.

cc @smit-kiri @JoshKarpel does this make sense to you? Thanks!

@JoshKarpel
Copy link
Contributor

The finalizer job does seem like the safest option of those presented.

That being said, another (backup?) option would be to put an expiration (https://redis.io/commands/expire/) on the single Redis key that the GCS state is stored under when it is created, and refresh that duration regularly from the head pod (per this comment https://sourcegraph.com/github.com/ray-project/ray@4788e4fb50a961015c6a23a92ef70facb0f6ba66/-/blob/python/ray/_private/gcs_utils.py?L149-150). The expiration should probably be user-configurable and would be long enough that an ephemeral head pod failure wouldn't let the key actually expire (since it would come back up and refresh the expiration time) - depending on someone's needs it could be an hour, or a day, or a week, or whatever. Something like that would help in cases where the finalizer job fails (or could be the only solution, in principle). This seems elegant to me since it uses only Redis built-ins and doesn't need to answer questions about e.g. retrying the finalizer job on failure.

@smit-kiri
Copy link
Author

I like the finalizer job, but also agree with @JoshKarpel that the key expiration would be a more elegant solution.

@edoakes
Copy link
Contributor

edoakes commented Sep 5, 2023

@iycheng any concerns from your end on the suggestion to put an expiration on the Redis key?

@scv119
Copy link

scv119 commented Sep 5, 2023

The finalizer job does seem like the safest option of those presented.

Looks this is the optimal solution.

@kevin85421
Copy link
Member

Looks this is the optimal solution.

@scv119 Do you mean (1) Finalizer Job + Expiration or (2) Expiration or (3) Finalizer Job? If there is no concern about the key expiration, it seems to be a better solution (i.e., (2)). cc @iycheng

@kevin85421
Copy link
Member

Follow-up @scv119 @edoakes @iycheng

Do we have any plan for key expiration? Some users just follow up with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gcs ft rayservice
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants