Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] The agent.py process leaks memory #29199

Closed
vakker opened this issue Oct 8, 2022 · 15 comments · Fixed by #29451
Closed

[Dashboard] The agent.py process leaks memory #29199

vakker opened this issue Oct 8, 2022 · 15 comments · Fixed by #29451
Assignees
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P1 Issue that should be fixed within a few weeks

Comments

@vakker
Copy link
Contributor

vakker commented Oct 8, 2022

What happened + What you expected to happen

I raised this issue on the discussion forum.

I’m running some RLlib + Tune workloads on multiple nodes.

After a day or so I’m getting:

(_PackActor pid=233066, ip=10.10.4.2) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:                  
(_PackActor pid=233066, ip=10.10.4.2)   
(_PackActor pid=233066, ip=10.10.4.2) PID       MEM     COMMAND
(_PackActor pid=233066, ip=10.10.4.2) 256207    127.58GiB       /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-ad
(_PackActor pid=233066, ip=10.10.4.2) 256148    3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip=10.10.4.2) 256060    1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip=10.10.4.2) 256351    1.51GiB python3 -u scripts/train.py --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip=10.10.4.2) 260405    1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258394    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260391    1.24GiB ray::RolloutWorker                                                                                                                                         
(_PackActor pid=233066, ip=10.10.4.2) 258389    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260392    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258382    1.24GiB ray::RolloutWorker  

That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false (according to the doc), but this might be indicating some deeper issue.

Versions / Dependencies

Ray: 9f1ea30
Python: 3.9.7
Ubuntu: 20.04 (Docker base: nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04)

Reproduction script

I'm not sure how to reproduce this with a small script, I'm running quite large experiments at the moment on a Slurm cluster.
Currently the simplest setup that shows this issue is the following:

  1. 1 GPU node that runs the PPO trainers (ray::PPO.train())
  2. 1 CPU node that only runs environment samplers (ray::RolloutWorker) and the Ray head.

The OOM happens on the CPU node.
I think a simple Tune + RLlib setup should be able to reproduce this on a single node, I don't have that much custom stuff in my code. But maybe it's the node to node communication that causes the leak.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@vakker vakker added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 8, 2022
@scottsun94 scottsun94 added the dashboard Issues specific to the Ray Dashboard label Oct 11, 2022
@scottsun94
Copy link
Contributor

@architkulkarni Are you able to follow up and investigate based on the current info we have?

@architkulkarni
Copy link
Contributor

@vakker Thanks for the details! Are you able to share dashboard.log and dashboard_agent.log from /tmp/ray/session_latest/logs on the node that's OOMing? That might help us investigate.

@alanwguo
Copy link
Contributor

Is this related to #26568 ?

@architkulkarni
Copy link
Contributor

I think it's possible but unlikely, here it's dashboard_agent.py that is running out of memory, whereas in the other issue it's dashboard.py.

Another piece of data here is that @vakker mentioned runtime_env isn't being used anywhere in the code (unless it's being used internally by Tune or RLlib somehow, the dashboard_agent.log logs will confirm this)

@rkooo567 rkooo567 self-assigned this Oct 14, 2022
@rkooo567 rkooo567 added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 14, 2022
@rkooo567 rkooo567 added this to the General Observability milestone Oct 14, 2022
@rkooo567
Copy link
Contributor

Hey @vakker how often do you see this? We have other user who reported similar issues, but from his response, it seems to not happen all the time.

We are trying to reproduce the issue, but no luck so far. If I merge the PR to allow memory profiling of an agent, would you be open to run some commands I am asking you?

@gjoliver
Copy link
Member

@vakker can you share your Tune/RLlib config dictionary?
we are trying to reproduce this on our end.

@vakker
Copy link
Contributor Author

vakker commented Oct 14, 2022

I'll try to reproduce it with a simple rllib config, e.g. rllib train -f atari-ppo.yaml
Let me get back to you in a bit on the findings.

@architkulkarni But I've only seen it once, when I run an experiment with a lot of workers. Then I switched off the dashboard.
@gjoliver I can share my Slurm scripts, that should be enough to reproduce.

@architkulkarni
Copy link
Contributor

@vakker Thanks for the additional info--in case you remember the date and time you were running the session which failed, you might still be able to find dashboard_agent.log in /tmp/ray/session_2022_<...>/logs.

@rkooo567
Copy link
Contributor

Thank you! @vakker. And thanks for taking your time to create a repro. We are also trying hard to find repro script ourselves (but no luck so far...)

A couple additional questions;

  1. When do you turn off dashboard (include_dahsboard=False), you don't see this anymore? I am asking it because Ray still starts agent when include_dashboard=False. So it means some dashboard-related operation is the root cause.
  2. How many rollout workers do you have?

@vakker
Copy link
Contributor Author

vakker commented Oct 17, 2022

Okey, I managed to reproduce this. I can send repro info on Slack instead of posting it all here.

@rkooo567
Copy link
Contributor

hey @vakker can you try run ray with this env var RAY_metrics_report_interval_ms=30000 and see if you can repro this? From our repro, this fixed the issue, and I am making a PR to fix it. I'd be really great if you can double check verifying this

@vakker
Copy link
Contributor Author

vakker commented Oct 20, 2022

Sure, I submitted a job yesterday, it's been running for 13.5h now, and the memory usage of agent.py is ~10GB, which is I think better than before.

Correct me if I'm wrong, but changing the interval only changes how fast the process leaks memory, but it doesn't address the leak itself?

@rkooo567 rkooo567 reopened this Oct 20, 2022
@rkooo567 rkooo567 removed the release-blocker P0 Issue that blocks the release label Oct 20, 2022
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Oct 20, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 20, 2022

State: Short term mitigation PR merged.

Yeah @vakker you are right. I think the PR I merged will drastically reduce the mem leak, but it is not the fundamental fix, and we have a follow-up work we will merge by the next release (2.2 around Dec). I also highly encouraged you to try 218f9ba -> this commit to see the pace of the leak (it should be even slower).

The memory is leaked from gRPC from the agent, and this seems to happen when there are lots of workers (all workers periodically send RPCs to agent with the given interval). And the initial theory is when there are too many requests, it is pending (because the agent CPU is 100%) and not GC'ed because gRPC doesn't GC requests until it is replied.

I will do additional investigation and some have fixes to reduce CPU usage of the agent.

@rkooo567
Copy link
Contributor

Should be fixed in the master!

@rkooo567
Copy link
Contributor

The change will be included in Ray 2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants