[Dashboard] The agent.py process leaks memory #29199

vakker · 2022-10-08T10:17:31Z

What happened + What you expected to happen

I raised this issue on the discussion forum.

I’m running some RLlib + Tune workloads on multiple nodes.

After a day or so I’m getting:

(_PackActor pid=233066, ip=10.10.4.2) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:                  
(_PackActor pid=233066, ip=10.10.4.2)   
(_PackActor pid=233066, ip=10.10.4.2) PID       MEM     COMMAND
(_PackActor pid=233066, ip=10.10.4.2) 256207    127.58GiB       /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-ad
(_PackActor pid=233066, ip=10.10.4.2) 256148    3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip=10.10.4.2) 256060    1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip=10.10.4.2) 256351    1.51GiB python3 -u scripts/train.py --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip=10.10.4.2) 260405    1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258394    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260391    1.24GiB ray::RolloutWorker                                                                                                                                         
(_PackActor pid=233066, ip=10.10.4.2) 258389    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260392    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258382    1.24GiB ray::RolloutWorker

That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false (according to the doc), but this might be indicating some deeper issue.

Versions / Dependencies

Ray: 9f1ea30
Python: 3.9.7
Ubuntu: 20.04 (Docker base: nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04)

Reproduction script

I'm not sure how to reproduce this with a small script, I'm running quite large experiments at the moment on a Slurm cluster.
Currently the simplest setup that shows this issue is the following:

1 GPU node that runs the PPO trainers (ray::PPO.train())
1 CPU node that only runs environment samplers (ray::RolloutWorker) and the Ray head.

The OOM happens on the CPU node.
I think a simple Tune + RLlib setup should be able to reproduce this on a single node, I don't have that much custom stuff in my code. But maybe it's the node to node communication that causes the leak.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

scottsun94 · 2022-10-12T05:23:33Z

@architkulkarni Are you able to follow up and investigate based on the current info we have?

architkulkarni · 2022-10-12T16:06:25Z

@vakker Thanks for the details! Are you able to share dashboard.log and dashboard_agent.log from /tmp/ray/session_latest/logs on the node that's OOMing? That might help us investigate.

alanwguo · 2022-10-12T22:08:10Z

Is this related to #26568 ?

architkulkarni · 2022-10-12T22:17:01Z

I think it's possible but unlikely, here it's dashboard_agent.py that is running out of memory, whereas in the other issue it's dashboard.py.

Another piece of data here is that @vakker mentioned runtime_env isn't being used anywhere in the code (unless it's being used internally by Tune or RLlib somehow, the dashboard_agent.log logs will confirm this)

rkooo567 · 2022-10-14T16:04:00Z

Hey @vakker how often do you see this? We have other user who reported similar issues, but from his response, it seems to not happen all the time.

We are trying to reproduce the issue, but no luck so far. If I merge the PR to allow memory profiling of an agent, would you be open to run some commands I am asking you?

gjoliver · 2022-10-14T18:58:46Z

@vakker can you share your Tune/RLlib config dictionary?
we are trying to reproduce this on our end.

vakker · 2022-10-14T20:35:20Z

I'll try to reproduce it with a simple rllib config, e.g. rllib train -f atari-ppo.yaml
Let me get back to you in a bit on the findings.

@architkulkarni But I've only seen it once, when I run an experiment with a lot of workers. Then I switched off the dashboard.
@gjoliver I can share my Slurm scripts, that should be enough to reproduce.

architkulkarni · 2022-10-14T23:50:23Z

@vakker Thanks for the additional info--in case you remember the date and time you were running the session which failed, you might still be able to find dashboard_agent.log in /tmp/ray/session_2022_<...>/logs.

rkooo567 · 2022-10-17T07:09:52Z

Thank you! @vakker. And thanks for taking your time to create a repro. We are also trying hard to find repro script ourselves (but no luck so far...)

A couple additional questions;

When do you turn off dashboard (include_dahsboard=False), you don't see this anymore? I am asking it because Ray still starts agent when include_dashboard=False. So it means some dashboard-related operation is the root cause.
How many rollout workers do you have?

vakker · 2022-10-17T10:38:39Z

Okey, I managed to reproduce this. I can send repro info on Slack instead of posting it all here.

rkooo567 · 2022-10-19T13:42:09Z

hey @vakker can you try run ray with this env var RAY_metrics_report_interval_ms=30000 and see if you can repro this? From our repro, this fixed the issue, and I am making a PR to fix it. I'd be really great if you can double check verifying this

vakker · 2022-10-20T10:12:22Z

Sure, I submitted a job yesterday, it's been running for 13.5h now, and the memory usage of agent.py is ~10GB, which is I think better than before.

Correct me if I'm wrong, but changing the interval only changes how fast the process leaks memory, but it doesn't address the leak itself?

rkooo567 · 2022-10-20T11:10:30Z

State: Short term mitigation PR merged.

Yeah @vakker you are right. I think the PR I merged will drastically reduce the mem leak, but it is not the fundamental fix, and we have a follow-up work we will merge by the next release (2.2 around Dec). I also highly encouraged you to try 218f9ba -> this commit to see the pace of the leak (it should be even slower).

The memory is leaked from gRPC from the agent, and this seems to happen when there are lots of workers (all workers periodically send RPCs to agent with the given interval). And the initial theory is when there are too many requests, it is pending (because the agent CPU is 100%) and not GC'ed because gRPC doesn't GC requests until it is replied.

I will do additional investigation and some have fixes to reduce CPU usage of the agent.

rkooo567 · 2022-11-22T01:33:31Z

Should be fixed in the master!

rkooo567 · 2022-11-22T01:33:38Z

The change will be included in Ray 2.2

vakker added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 8, 2022

scottsun94 added the dashboard Issues specific to the Ray Dashboard label Oct 11, 2022

scottsun94 mentioned this issue Oct 13, 2022

[Core] Memory leak (?) with no activity #28047

Closed

rkooo567 self-assigned this Oct 14, 2022

rkooo567 added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 14, 2022

rkooo567 added this to the General Observability milestone Oct 14, 2022

rkooo567 mentioned this issue Oct 14, 2022

[Core] Allow to run memray against dashboard / agent #29366

Merged

7 tasks

rickyyx added release-blocker P0 Issue that blocks the release r2.1-failure labels Oct 17, 2022

scv119 mentioned this issue Oct 17, 2022

[release][CI] chaos_dask_on_ray_large_scale_test_no_spilling failed #29100

Closed

This was referenced Oct 17, 2022

[release][CI] long_running_impala failed #29319

Closed

[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294

Closed

rkooo567 mentioned this issue Oct 19, 2022

[Metrics] Disable high cardinality operation metrics by default #29451

Merged

7 tasks

rkooo567 closed this as completed in #29451 Oct 20, 2022

rkooo567 reopened this Oct 20, 2022

rkooo567 removed the release-blocker P0 Issue that blocks the release label Oct 20, 2022

rkooo567 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Oct 20, 2022

rkooo567 removed the r2.1-failure label Oct 20, 2022

rkooo567 closed this as completed Nov 22, 2022

will-hang mentioned this issue Dec 13, 2022

[Ray Client] Transfer dashboard_url over gRPC instead of ray.remote #30941

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dashboard] The agent.py process leaks memory #29199

[Dashboard] The agent.py process leaks memory #29199

vakker commented Oct 8, 2022

scottsun94 commented Oct 12, 2022

architkulkarni commented Oct 12, 2022

alanwguo commented Oct 12, 2022

architkulkarni commented Oct 12, 2022

rkooo567 commented Oct 14, 2022

gjoliver commented Oct 14, 2022

vakker commented Oct 14, 2022

architkulkarni commented Oct 14, 2022

rkooo567 commented Oct 17, 2022

vakker commented Oct 17, 2022

rkooo567 commented Oct 19, 2022

vakker commented Oct 20, 2022

rkooo567 commented Oct 20, 2022 •

edited

Loading

rkooo567 commented Nov 22, 2022

rkooo567 commented Nov 22, 2022

[Dashboard] The agent.py process leaks memory #29199

[Dashboard] The agent.py process leaks memory #29199

Comments

vakker commented Oct 8, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

scottsun94 commented Oct 12, 2022

architkulkarni commented Oct 12, 2022

alanwguo commented Oct 12, 2022

architkulkarni commented Oct 12, 2022

rkooo567 commented Oct 14, 2022

gjoliver commented Oct 14, 2022

vakker commented Oct 14, 2022

architkulkarni commented Oct 14, 2022

rkooo567 commented Oct 17, 2022

vakker commented Oct 17, 2022

rkooo567 commented Oct 19, 2022

vakker commented Oct 20, 2022

rkooo567 commented Oct 20, 2022 • edited Loading

rkooo567 commented Nov 22, 2022

rkooo567 commented Nov 22, 2022

rkooo567 commented Oct 20, 2022 •

edited

Loading