-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Out of Disk prevention #25370
Conversation
ready for initial feedbacks. need add performance optimization and unit test. |
Also, is the PR description accurate? It looks like the PR only fails objects, not tasks, right? |
ah i have some tests covering task cases. (test_task_of_disk*). the missing one is the task arguments.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nit!
|
||
|
||
@pytest.mark.skipif(platform.system() == "Windows", reason="Not targeting Windows") | ||
def test_put_fits_in_memory(shutdown_only): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this test? (isn't it just a normal situation?)
@@ -93,6 +101,12 @@ Status CreateRequestQueue::ProcessRequests() { | |||
bool spilling_required = false; | |||
auto status = | |||
ProcessRequest(/*fallback_allocator=*/false, *request_it, &spilling_required); | |||
|
|||
if (MayHandleOutOfDisk(*request_it)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal NIT, but I feel like it might be easier to understand if we don't have MayHandleOutOfDisk
?
if (request_it->error == PlasmaError::OutOfMemory && fs_monitor_.OverCapacity()) {
request_it->error = PlasmaError::OutOfDisk;
FinishRequest(request_it)
}
huh the test fails on linux. looking... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of surprised there's no changes to the spill manager? I thought we would need to make changes there so that we only spill if there is enough disk space. Or is the intention that it's okay if we go over the threshold a bit if there is concurrent spilling?
python/ray/exceptions.py
Outdated
def __str__(self): | ||
return super(OutOfDiskError, self).__str__() + ( | ||
"\n" | ||
"The local object store is full and local disk is also full." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this error more descriptive.
We should say how much memory and disk are being used, if we can.
Ideally, we should also state how the user should deal with the problem (give the Ray config variable to set, add more disk or nodes, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can create a out of disk documentation and link there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the plumbing the exact disk size is a bit challenging. Let's defer that to another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about at least listing the disk percentage set in the ray config?
Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a line about the new config variable to https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html?
python/ray/exceptions.py
Outdated
def __str__(self): | ||
return super(OutOfDiskError, self).__str__() + ( | ||
"\n" | ||
"The local object store is full and local disk is also full." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about at least listing the disk percentage set in the ray config?
Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."
time.sleep(1) | ||
return np.random.rand(20 * 1024 * 1024) # 160 MB data | ||
|
||
with pytest.raises(ray.exceptions.RayTaskError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check that it's the correct error type? (e.as_instanceof_cause()
)
return false; | ||
} | ||
if (space_info->capacity <= 0) { | ||
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity."; | |
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity, object creation will fail if spilling is required."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I would still suggest mentioning something about object creation in the error message (since I don't think it's clear why disk space matters), but feel free to merge when ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blindly approve to unblock this one given it's been reviewed by other people.
serve test failure looks unrelated. |
} | ||
|
||
RAY_LOG_EVERY_MS(ERROR, 10 * 1000) | ||
<< path << " is over " << capacity_threshold_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scv119 should this be capacity_threshold_ * 100
? I'm seeing things like
(raylet, ip=172.31.58.175) [2022-06-28 03:48:42,324 E 702775 702805] (raylet) file_system_monitor.cc:105: /mnt/data0/ray is over 0.95% full, available space: 50637901824. Object creation will fail if spilling is required.
Why are these changes needed?
Problem
Ray (on K8s) fails silently when running out of disk space.
Today, when running a script that has a large amount of object spilling, if the disk runs out of space then Kubernetes will silently terminate the node. Autoscaling will kick in and replace the dead node. There is no indication that there was a failure due to disk space.
Instead, we should fail tasks with a good error message when the disk is full.
This solution is straightforward.
We monitor the disk usage, when node disk usage grows over the predefined capacity (like 90%), we fail new task/actor/object put that allocates new objects.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.