-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Deleting RayService does not clear Redis cache #1286
Comments
Big plus one on that from me - this seems like it will be a common problem across all users of KubeRay + GCS FT that everyone will otherwise have to solve themselves. Redis key expiry might work here too, if the GCS key has a (long) expiration that is refreshed regularly (by the head node maybe?). |
Is it also related to the same value of cc @iycheng @edoakes Who should take the responsibility to clear the Redis cache: KubeRay, Users, or Ray? Thanks! |
Yes, but even if you change the namespace, the data in the old namespace does not go away. The memory usage just keeps increasing |
I discussed this with Ray Core folk @iycheng. Ray provides a private util function cleanup_redis_storage to delete the storage namespace in Redis. However, it cannot fully delete the storage namespace if the GCS process on the head Pod is still running. We discussed some possible solutions:
My current thought is to implement "Create a Kubernetes Job for the RayCluster to clean up Redis". To elaborate,
cc @smit-kiri @JoshKarpel does this make sense to you? Thanks! |
The finalizer job does seem like the safest option of those presented. That being said, another (backup?) option would be to put an expiration (https://redis.io/commands/expire/) on the single Redis key that the GCS state is stored under when it is created, and refresh that duration regularly from the head pod (per this comment https://sourcegraph.com/github.com/ray-project/ray@4788e4fb50a961015c6a23a92ef70facb0f6ba66/-/blob/python/ray/_private/gcs_utils.py?L149-150). The expiration should probably be user-configurable and would be long enough that an ephemeral head pod failure wouldn't let the key actually expire (since it would come back up and refresh the expiration time) - depending on someone's needs it could be an hour, or a day, or a week, or whatever. Something like that would help in cases where the finalizer job fails (or could be the only solution, in principle). This seems elegant to me since it uses only Redis built-ins and doesn't need to answer questions about e.g. retrying the finalizer job on failure. |
I like the finalizer job, but also agree with @JoshKarpel that the key expiration would be a more elegant solution. |
@iycheng any concerns from your end on the suggestion to put an expiration on the Redis key? |
Looks this is the optimal solution. |
@scv119 Do you mean (1) Finalizer Job + Expiration or (2) Expiration or (3) Finalizer Job? If there is no concern about the key expiration, it seems to be a better solution (i.e., (2)). cc @iycheng |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When deleting RayService with GCS fault tolerance using
kubectl delete rayservice xxxx
command, the Redis cache isn't cleared. So if we deploy a new RayService later with a different config, the older RayService is restored, ignoring the current config.Reproduction script
Deploy any RayService with
RAY_REDIS_ADDRESS
set. Delete the RayService usingkubectl delete rayservice rayservice_sample
.Change the
serveConfigV2
with completely new deployments / applications, and apply the RayService with the sameRAY_REDIS_ADDRESS
and you'll notice the old RayService being deployed.Anything else
This is a slight inconvenience, since we're only deleting and re-creating RayService in a dev environment for testing purposes and we cannot use Redis there. To keep using redis, we need to reboot the redis node whenever we delete the RayService so the cache is cleared.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: