[Bug] Deleting RayService does not clear Redis cache #1286

smit-kiri · 2023-08-02T19:04:02Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When deleting RayService with GCS fault tolerance using kubectl delete rayservice xxxx command, the Redis cache isn't cleared. So if we deploy a new RayService later with a different config, the older RayService is restored, ignoring the current config.

Reproduction script

Deploy any RayService with RAY_REDIS_ADDRESS set. Delete the RayService using kubectl delete rayservice rayservice_sample.

Change the serveConfigV2 with completely new deployments / applications, and apply the RayService with the same RAY_REDIS_ADDRESS and you'll notice the old RayService being deployed.

Anything else

This is a slight inconvenience, since we're only deleting and re-creating RayService in a dev environment for testing purposes and we cannot use Redis there. To keep using redis, we need to reboot the redis node whenever we delete the RayService so the cache is cleared.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

smit-kiri · 2023-08-17T23:21:12Z

This also leads to a situation where the redis database memory usage keep increasing as we update our RayService. Ideally when KubeRay switches over traffic to the new RayCluster and deletes the old one, it should also clear GCS cache for the old cluster.

JoshKarpel · 2023-08-18T14:42:46Z

This also leads to a situation where the redis database memory usage keep increasing as we update our RayService. Ideally when KubeRay switches over traffic to the new RayCluster and deletes the old one, it should also clear GCS cache for the old cluster.

Big plus one on that from me - this seems like it will be a common problem across all users of KubeRay + GCS FT that everyone will otherwise have to solve themselves.

Redis key expiry might work here too, if the GCS key has a (long) expiration that is refreshed regularly (by the head node maybe?).

kevin85421 · 2023-08-24T18:18:01Z

Change the serveConfigV2 with completely new deployments / applications, and apply the RayService with the same RAY_REDIS_ADDRESS and you'll notice the old RayService being deployed.

Is it also related to the same value of ray.io/external-storage-namespace for both old/new RayService custom resources?

#1286 (comment)

cc @iycheng @edoakes

Who should take the responsibility to clear the Redis cache: KubeRay, Users, or Ray? Thanks!

smit-kiri · 2023-08-24T18:20:44Z

Is it also related to the same value of ray.io/external-storage-namespace for both old/new RayService custom resources?

Yes, but even if you change the namespace, the data in the old namespace does not go away. The memory usage just keeps increasing

kevin85421 · 2023-08-30T23:07:51Z

I discussed this with Ray Core folk @iycheng. Ray provides a private util function cleanup_redis_storage to delete the storage namespace in Redis. However, it cannot fully delete the storage namespace if the GCS process on the head Pod is still running. We discussed some possible solutions:

KubeRay sends requests to delete storage namespace in Redis: This might not be effective, as KubeRay might lack access to Redis.
KubeRay submits a job to the RayCluster to call the function cleanup_redis_storage: While this approach can remove some data from the storage namespace, it can't completely delete the namespace because GCS is still running.
Kill the GCS process and call cleanup_redis_storage: If everything goes well, the storage namespace can be fully deleted. However, Ray head Pod will crash and restart if it cannot connect to GCS for more than RAY_gcs_rpc_server_reconnect_timeout_s seconds (60 by default). Typically, the cleanup process is pretty fast, but we still cannot guarantee that the storage namespace can always be deleted.
Pod preStop hook: In my understanding, the hook will be triggered before the Pod receives the TERM signal, and there seems to be a 30-second timeout for the hook. In addition, the hook needs to know whether it is triggered because of an accidental crash or an intentional cluster deletion. For the former, we should not clean up Redis. For the later, we need to clean up the Redis.
Create a Kubernetes Job for the RayCluster to clean up Redis: This seems to be the best KubeRay solution although it needs to create an additional Kubernetes Job and its lifecycle is async with Pods in the Ray cluster.
Update Ray Core: Ideally, when I call cleanup_redis_storage to clean up the storage namespace, GCS should stop writing data to the external Redis. However, it seems to be not easy to implement (?).

My current thought is to implement "Create a Kubernetes Job for the RayCluster to clean up Redis". To elaborate,

Add a finalizer to the RayCluster CR if GCS FT is enabled.
If KubeRay receives the CR deletion event, delete all Pods (head / worker) belong to the RayCluster.
After all Pods are gone, create a Kubernetes Job to clean up the storage namespace in Redis.
If the job succeeds, remove the finalizer. Otherwise, leave the RayCluster there.

cc @smit-kiri @JoshKarpel does this make sense to you? Thanks!

JoshKarpel · 2023-09-05T11:58:46Z

The finalizer job does seem like the safest option of those presented.

That being said, another (backup?) option would be to put an expiration (https://redis.io/commands/expire/) on the single Redis key that the GCS state is stored under when it is created, and refresh that duration regularly from the head pod (per this comment https://sourcegraph.com/github.com/ray-project/ray@4788e4fb50a961015c6a23a92ef70facb0f6ba66/-/blob/python/ray/_private/gcs_utils.py?L149-150). The expiration should probably be user-configurable and would be long enough that an ephemeral head pod failure wouldn't let the key actually expire (since it would come back up and refresh the expiration time) - depending on someone's needs it could be an hour, or a day, or a week, or whatever. Something like that would help in cases where the finalizer job fails (or could be the only solution, in principle). This seems elegant to me since it uses only Redis built-ins and doesn't need to answer questions about e.g. retrying the finalizer job on failure.

smit-kiri · 2023-09-05T12:54:31Z

I like the finalizer job, but also agree with @JoshKarpel that the key expiration would be a more elegant solution.

edoakes · 2023-09-05T15:40:56Z

@iycheng any concerns from your end on the suggestion to put an expiration on the Redis key?

scv119 · 2023-09-05T21:10:12Z

The finalizer job does seem like the safest option of those presented.

Looks this is the optimal solution.

kevin85421 · 2023-09-05T22:59:34Z

Looks this is the optimal solution.

@scv119 Do you mean (1) Finalizer Job + Expiration or (2) Expiration or (3) Finalizer Job? If there is no concern about the key expiration, it seems to be a better solution (i.e., (2)). cc @iycheng

kevin85421 · 2023-09-13T21:25:35Z

Follow-up @scv119 @edoakes @iycheng

Do we have any plan for key expiration? Some users just follow up with me.

smit-kiri added the bug Something isn't working label Aug 2, 2023

kevin85421 added rayservice gcs ft labels Aug 7, 2023

kevin85421 mentioned this issue Apr 19, 2023

[Umbrella] GCS fault tolerance on KubeRay #1033

Open

22 tasks

kevin85421 self-assigned this Aug 23, 2023

kevin85421 mentioned this issue Sep 11, 2023

[Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted #1412

Merged

4 tasks

kevin85421 closed this as completed in #1412 Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Deleting RayService does not clear Redis cache #1286

[Bug] Deleting RayService does not clear Redis cache #1286

smit-kiri commented Aug 2, 2023

smit-kiri commented Aug 17, 2023

JoshKarpel commented Aug 18, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 30, 2023

JoshKarpel commented Sep 5, 2023

smit-kiri commented Sep 5, 2023

edoakes commented Sep 5, 2023

scv119 commented Sep 5, 2023 •

edited

Loading

kevin85421 commented Sep 5, 2023

kevin85421 commented Sep 13, 2023

[Bug] Deleting RayService does not clear Redis cache #1286

[Bug] Deleting RayService does not clear Redis cache #1286

Comments

smit-kiri commented Aug 2, 2023

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

smit-kiri commented Aug 17, 2023

JoshKarpel commented Aug 18, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 30, 2023

JoshKarpel commented Sep 5, 2023

smit-kiri commented Sep 5, 2023

edoakes commented Sep 5, 2023

scv119 commented Sep 5, 2023 • edited Loading

kevin85421 commented Sep 5, 2023

kevin85421 commented Sep 13, 2023

scv119 commented Sep 5, 2023 •

edited

Loading