Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM will cause the ray job fail #622

Closed
Michaelvll opened this issue Mar 20, 2022 · 10 comments
Closed

OOM will cause the ray job fail #622

Michaelvll opened this issue Mar 20, 2022 · 10 comments
Labels
bug Something isn't working Stale

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 20, 2022

When the system is about to run out of memory, ray job submit and ray job stop will fail (submit will say the task is successfully submitted, but ray job status will say it was failed; stop will keep waiting for the job to be killed forever), that the user cannot even kill a process.

One way to fix this may be that we abort the whole usage of ray job and track the pid of tasks by ourselves.

@github-actions
Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label May 13, 2023
@Michaelvll
Copy link
Collaborator Author

#1218 should mitigate this issue.

@Michaelvll Michaelvll removed the Stale label May 13, 2023
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jun 5, 2023

This should be mitigated by the upgrade to ray 2.4.0 #1734 due to the OOM prevention feature https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html

@concretevitamin
Copy link
Member

It seems that an OOM-ed cluster also fails to respond to sky down.

A 2-node cluster running a DeepSpeed job


Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_1", line 499, in <module>
    returncodes = ray.get(futures)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2537, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.11.97, ID: 466e58448fea9e5ce4f29e4d3d48b6b1aaa3cfc92e0a6ad808ec1a85) where the task (task ID: 1848be8e198993e734211cb7b1e7a4b5443a330002000000, name=head, rank=0,, pid=27300, memory used=0.08GB) was running was 14.76GB / 15.44GB (0.955438), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cf8b07c1cb3a1c0115a57d015d5b79e53c2f14b899c26190767b540e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.11.97`. To see the logs of the worker, use `ray logs worker-cf8b07c1cb3a1c0115a57d015d5b79e53c2f14b899c26190767b540e*out -ip 172.31.11.97. Top 10 memory users:
PID     MEM(GB) COMMAND
27671   12.90   /opt/conda/envs/deepspeed/bin/python3.8 -u main.py --local_rank=0 --data_path Dahoas/rm-static Dahoa...
27660   0.19    /opt/conda/envs/deepspeed/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyIxNzIuMzEuMTE...
27413   0.18    /opt/conda/envs/deepspeed/bin/python3.8 /opt/conda/envs/deepspeed/bin/deepspeed --num_gpus 1 main.py...
23104   0.11    /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/monitor...
27248   0.09    python3 -u /home/ubuntu/.sky/sky_app/sky_job_1
27300   0.08    ray::head, rank=0,
27259   0.08    python3 -u -c import os;from sky.skylet import job_lib, log_lib;job_id = 1 if 1 is not None else job...
27374   0.08    python3 /opt/conda/lib/python3.10/site-packages/sky/skylet/subprocess_daemon.py --parent-pid 27300 -...
25946   0.07    python3 -m sky.skylet.skylet
23264   0.06    /opt/conda/bin/python3.10 -u /opt/conda/lib/python3.10/site-packages/ray/dashboard/agent.py --node-i...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

On local machine, sky down gets stuck and I had to manually terminate the cluster on the cluster. I assume any autostop/down would not go through either.

@Michaelvll
Copy link
Collaborator Author

For the sky down stuck issue, which cloud are you using? ray will first try to run ray stop on the cluster before actually starts terminating the cluster. We tried to simulate that in our new termination codepath for AWS as well, but it would be nice to see if that simulation is correct

@concretevitamin
Copy link
Member

Was using AWS.

@Michaelvll
Copy link
Collaborator Author

Was using AWS.

In that case, it might worth it to add a timeout to the following call

self.run_on_head(handle, 'ray stop --force')

@concretevitamin
Copy link
Member

Can we make

def terminate_instances(region: str,

the underlying provision lib handle this? It already has knowledge of TAG_RAY_CLUSTER_NAME. Maybe it can query the head first, down/stop it, then handle the workers.

Wdyt @Michaelvll @suquark?

Copy link

github-actions bot commented Nov 8, 2023

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 8, 2023
Copy link

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants