-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM will cause the ray job fail #622
Comments
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
#1218 should mitigate this issue. |
This should be mitigated by the upgrade to ray 2.4.0 #1734 due to the OOM prevention feature https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html |
It seems that an OOM-ed cluster also fails to respond to A 2-node cluster running a DeepSpeed job
On local machine, |
For the |
Was using AWS. |
In that case, it might worth it to add a timeout to the following call skypilot/sky/backends/cloud_vm_ray_backend.py Line 3363 in 7262e21
|
Can we make skypilot/sky/provision/aws/instance.py Line 60 in 7262e21
the underlying provision lib handle this? It already has knowledge of Wdyt @Michaelvll @suquark? |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been stalled for 10 days with no activity. |
When the system is about to run out of memory,
ray job submit
andray job stop
will fail (submit
will say the task is successfully submitted, butray job status
will say it was failed;stop
will keep waiting for the job to be killed forever), that the user cannot even kill a process.One way to fix this may be that we abort the whole usage of
ray job
and track thepid
of tasks by ourselves.The text was updated successfully, but these errors were encountered: