Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Out of Disk prevention #25370

Merged
merged 51 commits into from
Jun 22, 2022
Merged

[Core] Out of Disk prevention #25370

merged 51 commits into from
Jun 22, 2022

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Jun 1, 2022

Why are these changes needed?

Problem

Ray (on K8s) fails silently when running out of disk space.
Today, when running a script that has a large amount of object spilling, if the disk runs out of space then Kubernetes will silently terminate the node. Autoscaling will kick in and replace the dead node. There is no indication that there was a failure due to disk space.
Instead, we should fail tasks with a good error message when the disk is full.

This solution is straightforward.

We monitor the disk usage, when node disk usage grows over the predefined capacity (like 90%), we fail new task/actor/object put that allocates new objects.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 marked this pull request as ready for review June 1, 2022 21:00
@scv119
Copy link
Contributor Author

scv119 commented Jun 1, 2022

ready for initial feedbacks. need add performance optimization and unit test.

@scv119 scv119 linked an issue Jun 1, 2022 that may be closed by this pull request
@stephanie-wang
Copy link
Contributor

Also, is the PR description accurate? It looks like the PR only fails objects, not tasks, right?

@scv119
Copy link
Contributor Author

scv119 commented Jun 1, 2022

ah i have some tests covering task cases. (test_task_of_disk*). the missing one is the task arguments..

@scv119 scv119 changed the title [Core][WIP] Out of Disk prevention [Core] Out of Disk prevention Jun 3, 2022
@scv119 scv119 added the do-not-merge Do not merge this PR! label Jun 3, 2022
@scv119 scv119 removed the do-not-merge Do not merge this PR! label Jun 5, 2022
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nit!

python/ray/serialization.py Outdated Show resolved Hide resolved


@pytest.mark.skipif(platform.system() == "Windows", reason="Not targeting Windows")
def test_put_fits_in_memory(shutdown_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? (isn't it just a normal situation?)

python/ray/tests/test_out_of_disk_space.py Show resolved Hide resolved
src/ray/common/file_system_monitor.h Show resolved Hide resolved
src/ray/common/file_system_monitor.h Show resolved Hide resolved
src/ray/common/file_system_monitor.cc Show resolved Hide resolved
src/ray/common/file_system_monitor.cc Show resolved Hide resolved
src/ray/common/file_system_monitor.cc Outdated Show resolved Hide resolved
@@ -93,6 +101,12 @@ Status CreateRequestQueue::ProcessRequests() {
bool spilling_required = false;
auto status =
ProcessRequest(/*fallback_allocator=*/false, *request_it, &spilling_required);

if (MayHandleOutOfDisk(*request_it)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal NIT, but I feel like it might be easier to understand if we don't have MayHandleOutOfDisk ?

  if (request_it->error == PlasmaError::OutOfMemory && fs_monitor_.OverCapacity()) {
    request_it->error = PlasmaError::OutOfDisk;
    FinishRequest(request_it)
  }

src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
@scv119
Copy link
Contributor Author

scv119 commented Jun 6, 2022

huh the test fails on linux. looking...

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of surprised there's no changes to the spill manager? I thought we would need to make changes there so that we only spill if there is enough disk space. Or is the intention that it's okay if we go over the threshold a bit if there is concurrent spilling?

python/ray/tests/test_out_of_disk_space.py Outdated Show resolved Hide resolved
def __str__(self):
return super(OutOfDiskError, self).__str__() + (
"\n"
"The local object store is full and local disk is also full."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this error more descriptive.

We should say how much memory and disk are being used, if we can.

Ideally, we should also state how the user should deal with the problem (give the Ray config variable to set, add more disk or nodes, etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create a out of disk documentation and link there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the plumbing the exact disk size is a bit challenging. Let's defer that to another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about at least listing the disk percentage set in the ray config?

Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."

src/ray/protobuf/common.proto Outdated Show resolved Hide resolved
src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 9, 2022
@scv119 scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 13, 2022
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a line about the new config variable to https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html?

def __str__(self):
return super(OutOfDiskError, self).__str__() + (
"\n"
"The local object store is full and local disk is also full."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about at least listing the disk percentage set in the ray config?

Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."

python/ray/exceptions.py Outdated Show resolved Hide resolved
time.sleep(1)
return np.random.rand(20 * 1024 * 1024) # 160 MB data

with pytest.raises(ray.exceptions.RayTaskError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that it's the correct error type? (e.as_instanceof_cause())

return false;
}
if (space_info->capacity <= 0) {
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity.";
RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity, object creation will fail if spilling is required.";

src/ray/common/file_system_monitor.cc Outdated Show resolved Hide resolved
src/ray/common/file_system_monitor.h Show resolved Hide resolved
src/ray/common/status.h Show resolved Hide resolved
src/ray/object_manager/plasma/plasma.fbs Outdated Show resolved Hide resolved
src/ray/common/file_system_monitor.cc Show resolved Hide resolved
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I would still suggest mentioning something about object creation in the error message (since I don't think it's clear why disk space matters), but feel free to merge when ready.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blindly approve to unblock this one given it's been reviewed by other people.

@scv119
Copy link
Contributor Author

scv119 commented Jun 22, 2022

serve test failure looks unrelated.

@scv119 scv119 merged commit afb092a into ray-project:master Jun 22, 2022
}

RAY_LOG_EVERY_MS(ERROR, 10 * 1000)
<< path << " is over " << capacity_threshold_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scv119 should this be capacity_threshold_ * 100? I'm seeing things like

(raylet, ip=172.31.58.175) [2022-06-28 03:48:42,324 E 702775 702805] (raylet) file_system_monitor.cc:105: /mnt/data0/ray is over 0.95% full, available space: 50637901824. Object creation will fail if spilling is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Out of Disk prevention mechansim
5 participants