-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] The agent.py process leaks memory #29199
Comments
@architkulkarni Are you able to follow up and investigate based on the current info we have? |
@vakker Thanks for the details! Are you able to share |
Is this related to #26568 ? |
I think it's possible but unlikely, here it's Another piece of data here is that @vakker mentioned |
Hey @vakker how often do you see this? We have other user who reported similar issues, but from his response, it seems to not happen all the time. We are trying to reproduce the issue, but no luck so far. If I merge the PR to allow memory profiling of an agent, would you be open to run some commands I am asking you? |
@vakker can you share your Tune/RLlib config dictionary? |
I'll try to reproduce it with a simple rllib config, e.g. @architkulkarni But I've only seen it once, when I run an experiment with a lot of workers. Then I switched off the dashboard. |
@vakker Thanks for the additional info--in case you remember the date and time you were running the session which failed, you might still be able to find |
Thank you! @vakker. And thanks for taking your time to create a repro. We are also trying hard to find repro script ourselves (but no luck so far...) A couple additional questions;
|
Okey, I managed to reproduce this. I can send repro info on Slack instead of posting it all here. |
hey @vakker can you try run ray with this env var RAY_metrics_report_interval_ms=30000 and see if you can repro this? From our repro, this fixed the issue, and I am making a PR to fix it. I'd be really great if you can double check verifying this |
Sure, I submitted a job yesterday, it's been running for 13.5h now, and the memory usage of Correct me if I'm wrong, but changing the interval only changes how fast the process leaks memory, but it doesn't address the leak itself? |
State: Short term mitigation PR merged. Yeah @vakker you are right. I think the PR I merged will drastically reduce the mem leak, but it is not the fundamental fix, and we have a follow-up work we will merge by the next release (2.2 around Dec). I also highly encouraged you to try 218f9ba -> this commit to see the pace of the leak (it should be even slower). The memory is leaked from gRPC from the agent, and this seems to happen when there are lots of workers (all workers periodically send RPCs to agent with the given interval). And the initial theory is when there are too many requests, it is pending (because the agent CPU is 100%) and not GC'ed because gRPC doesn't GC requests until it is replied. I will do additional investigation and some have fixes to reduce CPU usage of the agent. |
Should be fixed in the master! |
The change will be included in Ray 2.2 |
What happened + What you expected to happen
I raised this issue on the discussion forum.
I’m running some RLlib + Tune workloads on multiple nodes.
After a day or so I’m getting:
That doesn’t look healthy.
I can start the worker with
ray start <...> --include-dashboard false
(according to the doc), but this might be indicating some deeper issue.Versions / Dependencies
Ray: 9f1ea30
Python: 3.9.7
Ubuntu: 20.04 (Docker base:
nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
)Reproduction script
I'm not sure how to reproduce this with a small script, I'm running quite large experiments at the moment on a Slurm cluster.
Currently the simplest setup that shows this issue is the following:
ray::PPO.train()
)ray::RolloutWorker
) and the Ray head.The OOM happens on the CPU node.
I think a simple Tune + RLlib setup should be able to reproduce this on a single node, I don't have that much custom stuff in my code. But maybe it's the node to node communication that causes the leak.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: