[Core] Out of Disk prevention #25370

scv119 · 2022-06-01T20:56:01Z

Why are these changes needed?

Problem

Ray (on K8s) fails silently when running out of disk space.
Today, when running a script that has a large amount of object spilling, if the disk runs out of space then Kubernetes will silently terminate the node. Autoscaling will kick in and replace the dead node. There is no indication that there was a failure due to disk space.
Instead, we should fail tasks with a good error message when the disk is full.

This solution is straightforward.

We monitor the disk usage, when node disk usage grows over the predefined capacity (like 90%), we fail new task/actor/object put that allocates new objects.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scv119 · 2022-06-01T21:01:01Z

ready for initial feedbacks. need add performance optimization and unit test.

python/ray/tests/test_out_of_disk_space.py

src/ray/object_manager/plasma/store.cc

stephanie-wang · 2022-06-01T21:50:02Z

Also, is the PR description accurate? It looks like the PR only fails objects, not tasks, right?

scv119 · 2022-06-01T21:59:40Z

ah i have some tests covering task cases. (test_task_of_disk*). the missing one is the task arguments..

rkooo567

Mostly nit!

python/ray/serialization.py

rkooo567 · 2022-06-06T13:38:23Z

python/ray/tests/test_out_of_disk_space.py

+
+
+@pytest.mark.skipif(platform.system() == "Windows", reason="Not targeting Windows")
+def test_put_fits_in_memory(shutdown_only):


Do we need this test? (isn't it just a normal situation?)

python/ray/tests/test_out_of_disk_space.py

src/ray/common/file_system_monitor.h

src/ray/common/file_system_monitor.cc

rkooo567 · 2022-06-06T13:52:16Z

src/ray/object_manager/plasma/create_request_queue.cc

@@ -93,6 +101,12 @@ Status CreateRequestQueue::ProcessRequests() {
    bool spilling_required = false;
    auto status =
        ProcessRequest(/*fallback_allocator=*/false, *request_it, &spilling_required);
+
+    if (MayHandleOutOfDisk(*request_it)) {


Personal NIT, but I feel like it might be easier to understand if we don't have MayHandleOutOfDisk ?

if (request_it->error == PlasmaError::OutOfMemory && fs_monitor_.OverCapacity()) { request_it->error = PlasmaError::OutOfDisk; FinishRequest(request_it) }

src/ray/object_manager/pull_manager.cc

scv119 · 2022-06-06T17:31:57Z

huh the test fails on linux. looking...

stephanie-wang

I'm kind of surprised there's no changes to the spill manager? I thought we would need to make changes there so that we only spill if there is enough disk space. Or is the intention that it's okay if we go over the threshold a bit if there is concurrent spilling?

python/ray/tests/test_out_of_disk_space.py

stephanie-wang · 2022-06-07T02:00:13Z

python/ray/exceptions.py

+    def __str__(self):
+        return super(OutOfDiskError, self).__str__() + (
+            "\n"
+            "The local object store is full and local disk is also full."


I think we can make this error more descriptive.

We should say how much memory and disk are being used, if we can.

Ideally, we should also state how the user should deal with the problem (give the Ray config variable to set, add more disk or nodes, etc).

Maybe we can create a out of disk documentation and link there?

Turns out the plumbing the exact disk size is a bit challenging. Let's defer that to another PR.

How about at least listing the disk percentage set in the ray config?

Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."

src/ray/protobuf/common.proto

src/ray/object_manager/pull_manager.cc

stephanie-wang

Should we also add a line about the new config variable to https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html?

stephanie-wang · 2022-06-14T03:03:37Z

python/ray/exceptions.py

+    def __str__(self):
+        return super(OutOfDiskError, self).__str__() + (
+            "\n"
+            "The local object store is full and local disk is also full."


How about at least listing the disk percentage set in the ray config?

Also, suggest: "The object cannot be created because the local object store is full and at least X% of the local disk is in use."

python/ray/exceptions.py

stephanie-wang · 2022-06-14T03:09:39Z

python/ray/tests/test_out_of_disk_space.py

+            time.sleep(1)
+            return np.random.rand(20 * 1024 * 1024)  # 160 MB data
+
+        with pytest.raises(ray.exceptions.RayTaskError):


Can we check that it's the correct error type? (e.as_instanceof_cause())

stephanie-wang · 2022-06-14T03:11:35Z

src/ray/common/file_system_monitor.cc

+    return false;
+  }
+  if (space_info->capacity <= 0) {
+    RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity.";


Suggested change

RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity.";

RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity, object creation will fail if spilling is required.";

src/ray/common/file_system_monitor.cc

src/ray/common/file_system_monitor.h

src/ray/common/status.h

src/ray/object_manager/plasma/create_request_queue.cc

src/ray/object_manager/plasma/plasma.fbs

src/ray/common/file_system_monitor.cc

stephanie-wang

Looks good! I would still suggest mentioning something about object creation in the error message (since I don't think it's clear why disk space matters), but feel free to merge when ready.

fishbone

Blindly approve to unblock this one given it's been reviewed by other people.

scv119 · 2022-06-22T19:25:11Z

serve test failure looks unrelated.

franklsf95 · 2022-06-28T03:52:10Z

src/ray/common/file_system_monitor.cc

+  }
+
+  RAY_LOG_EVERY_MS(ERROR, 10 * 1000)
+      << path << " is over " << capacity_threshold_


@scv119 should this be capacity_threshold_ * 100? I'm seeing things like

(raylet, ip=172.31.58.175) [2022-06-28 03:48:42,324 E 702775 702805] (raylet) file_system_monitor.cc:105: /mnt/data0/ray is over 0.95% full, available space: 50637901824. Object creation will fail if spilling is required.

scv119 marked this pull request as ready for review June 1, 2022 21:00

scv119 assigned stephanie-wang, fishbone and rkooo567 Jun 1, 2022

scv119 linked an issue Jun 1, 2022 that may be closed by this pull request

Implement the Out of Disk prevention mechansim #25372

Closed

stephanie-wang reviewed Jun 1, 2022

View reviewed changes

python/ray/tests/test_out_of_disk_space.py Outdated Show resolved Hide resolved

src/ray/object_manager/plasma/store.cc Outdated Show resolved Hide resolved

scv119 changed the title ~~[Core][WIP] Out of Disk prevention~~ [Core] Out of Disk prevention Jun 3, 2022

scv119 added the do-not-merge Do not merge this PR! label Jun 3, 2022

scv119 requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen, fishbone and mwtian as code owners June 5, 2022 06:44

scv119 removed the do-not-merge Do not merge this PR! label Jun 5, 2022

rkooo567 reviewed Jun 6, 2022

View reviewed changes

stephanie-wang requested changes Jun 7, 2022

View reviewed changes

scv119 added 7 commits June 8, 2022 22:17

add

86ab5c7

add

0827fe6

add

2869ae1

add

85eec40

add

a9449bc

add

7c89d20

add

af82bcd

scv119 added 2 commits June 8, 2022 22:33

add

4dc64db

working on tests

ffd987c

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 9, 2022

scv119 added 2 commits June 13, 2022 12:05

fix test

613a95c

update

cb89f97

scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 13, 2022

stephanie-wang reviewed Jun 14, 2022

View reviewed changes

scv119 added 3 commits June 16, 2022 23:23

Merge remote-tracking branch 'upstream/master' into ood-take-1

b0fdc44

address comments

9990a34

add

f44ffd1

stephanie-wang approved these changes Jun 18, 2022

View reviewed changes

scv119 added 10 commits June 19, 2022 21:03

add

9a54a66

add

cc3b06c

add test

fc0dbc1

add

af15ae9

add

127fafc

add

f9fde2c

Merge branch 'master' into ood-take-1

baf19eb

fix-windows

9cff82b

add

f3bc923

fix ci

dc333af

fishbone approved these changes Jun 21, 2022

View reviewed changes

scv119 added 2 commits June 21, 2022 15:24

Merge remote-tracking branch 'upstream/master' into ood-take-1

cd55ffb

fix merge failure

cb32c24

scv119 merged commit afb092a into ray-project:master Jun 22, 2022

franklsf95 reviewed Jun 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Out of Disk prevention #25370

[Core] Out of Disk prevention #25370

scv119 commented Jun 1, 2022 •

edited

Loading

scv119 commented Jun 1, 2022

stephanie-wang commented Jun 1, 2022

scv119 commented Jun 1, 2022

rkooo567 left a comment

rkooo567 Jun 6, 2022

rkooo567 Jun 6, 2022

scv119 commented Jun 6, 2022

stephanie-wang left a comment

stephanie-wang Jun 7, 2022

rkooo567 Jun 9, 2022

scv119 Jun 13, 2022

stephanie-wang Jun 14, 2022

stephanie-wang left a comment

stephanie-wang Jun 14, 2022

stephanie-wang Jun 14, 2022

stephanie-wang Jun 14, 2022

stephanie-wang left a comment

fishbone left a comment

scv119 commented Jun 22, 2022

franklsf95 Jun 28, 2022



		@pytest.mark.skipif(platform.system() == "Windows", reason="Not targeting Windows")
		def test_put_fits_in_memory(shutdown_only):

	RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity.";
	RAY_LOG_EVERY_MS(ERROR, 60 * 1000) << path << " has no capacity, object creation will fail if spilling is required.";

[Core] Out of Disk prevention #25370

[Core] Out of Disk prevention #25370

Conversation

scv119 commented Jun 1, 2022 • edited Loading

Why are these changes needed?

Problem

This solution is straightforward.

Related issue number

Checks

scv119 commented Jun 1, 2022

stephanie-wang commented Jun 1, 2022

scv119 commented Jun 1, 2022

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 commented Jun 6, 2022

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

fishbone left a comment

Choose a reason for hiding this comment

scv119 commented Jun 22, 2022

Choose a reason for hiding this comment

scv119 commented Jun 1, 2022 •

edited

Loading