OOM will cause the ray job fail #622

Michaelvll · 2022-03-20T01:33:26Z

When the system is about to run out of memory, ray job submit and ray job stop will fail (submit will say the task is successfully submitted, but ray job status will say it was failed; stop will keep waiting for the job to be killed forever), that the user cannot even kill a process.

One way to fix this may be that we abort the whole usage of ray job and track the pid of tasks by ourselves.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-05-13T02:11:23Z

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Michaelvll · 2023-05-13T03:22:37Z

#1218 should mitigate this issue.

Michaelvll · 2023-06-05T22:17:10Z

This should be mitigated by the upgrade to ray 2.4.0 #1734 due to the OOM prevention feature https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html

concretevitamin · 2023-07-10T20:52:17Z

It seems that an OOM-ed cluster also fails to respond to sky down.

A 2-node cluster running a DeepSpeed job


Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_1", line 499, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.11.97, ID: 466e58448fea9e5ce4f29e4d3d48b6b1aaa3cfc92e0a6ad808ec1a85) where the task (task ID: 1848be8e198993e734211cb7b1e7a4b5443a330002000000, name=head, rank=0,, pid=27300, memory used=0.08GB) was running was 14.76GB / 15.44GB (0.955438), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cf8b07c1cb3a1c0115a57d015d5b79e53c2f14b899c26190767b540e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.11.97`. To see the logs of the worker, use `ray logs worker-cf8b07c1cb3a1c0115a57d015d5b79e53c2f14b899c26190767b540e*out -ip 172.31.11.97. Top 10 memory users:
PID     MEM(GB) COMMAND
27671   12.90   /opt/conda/envs/deepspeed/bin/python3.8 -u main.py --local_rank=0 --data_path Dahoas/rm-static Dahoa...
27660   0.19    /opt/conda/envs/deepspeed/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyIxNzIuMzEuMTE...
27413   0.18    /opt/conda/envs/deepspeed/bin/python3.8 /opt/conda/envs/deepspeed/bin/deepspeed --num_gpus 1 main.py...
23104   0.11    /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
27248   0.09    python3 -u /home/ubuntu/.sky/sky_app/sky_job_1
27300   0.08    ray::head, rank=0,
27259   0.08    python3 -u -c import os;from sky.skylet import job_lib, log_lib;job_id = 1 if 1 is not None else job...
27374   0.08    python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 27300 -...
25946   0.07    python3 -m sky.skylet.skylet
23264   0.06    /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

On local machine, sky down gets stuck and I had to manually terminate the cluster on the cluster. I assume any autostop/down would not go through either.

Michaelvll · 2023-07-10T21:17:14Z

For the sky down stuck issue, which cloud are you using? ray will first try to run ray stop on the cluster before actually starts terminating the cluster. We tried to simulate that in our new termination codepath for AWS as well, but it would be nice to see if that simulation is correct

concretevitamin · 2023-07-10T21:31:23Z

Was using AWS.

Michaelvll · 2023-07-10T22:44:37Z

Was using AWS.

In that case, it might worth it to add a timeout to the following call

skypilot/sky/backends/cloud_vm_ray_backend.py

Line 3363 in 7262e21

self.run_on_head(handle, 'ray stop --force')

concretevitamin · 2023-07-10T23:05:42Z

Can we make

skypilot/sky/provision/aws/instance.py

Line 60 in 7262e21

def terminate_instances(region: str,

the underlying provision lib handle this? It already has knowledge of TAG_RAY_CLUSTER_NAME. Maybe it can query the head first, down/stop it, then handle the workers.

Wdyt @Michaelvll @suquark?

github-actions · 2023-11-08T02:03:07Z

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2023-11-18T02:05:04Z

This issue was closed because it has been stalled for 10 days with no activity.

Michaelvll added the bug Something isn't working label Mar 20, 2022

mraheja assigned mraheja and unassigned mraheja Mar 22, 2022

Michaelvll mentioned this issue Apr 9, 2022

Goofys memory optimizations #726

Merged

1 task

romilbhardwaj mentioned this issue Apr 9, 2022

Storage: Goofys OOM-killed when mounting/using a big bucket #706

Closed

github-actions bot added the Stale label May 13, 2023

Michaelvll removed the Stale label May 13, 2023

github-actions bot added the Stale label Nov 8, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM will cause the ray job fail #622

OOM will cause the ray job fail #622

Michaelvll commented Mar 20, 2022 •

edited

Loading

github-actions bot commented May 13, 2023

Michaelvll commented May 13, 2023

Michaelvll commented Jun 5, 2023 •

edited

Loading

concretevitamin commented Jul 10, 2023

Michaelvll commented Jul 10, 2023

concretevitamin commented Jul 10, 2023

Michaelvll commented Jul 10, 2023

concretevitamin commented Jul 10, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 18, 2023

OOM will cause the ray job fail #622

OOM will cause the ray job fail #622

Comments

Michaelvll commented Mar 20, 2022 • edited Loading

github-actions bot commented May 13, 2023

Michaelvll commented May 13, 2023

Michaelvll commented Jun 5, 2023 • edited Loading

concretevitamin commented Jul 10, 2023

Michaelvll commented Jul 10, 2023

concretevitamin commented Jul 10, 2023

Michaelvll commented Jul 10, 2023

concretevitamin commented Jul 10, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 18, 2023

Michaelvll commented Mar 20, 2022 •

edited

Loading

Michaelvll commented Jun 5, 2023 •

edited

Loading