[Core][Spilled Object Leakage] More robust spilled object deletion #29014

scv119 · 2022-10-03T20:46:38Z

Why are these changes needed?

We have noticed spilled objects not deleted even if the job creating those objects finished execution. After reading the code my theory is that the object delegation worker failed in the middle of deleting spilled files, which doesn't handle well in today's spilled object deletion logic.

Though I no longer get a reproduction, (which I suspect due to the fix #26395), we can enhance the failure handle logic when object deletion failed.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scv119 · 2022-10-03T20:59:36Z

TBD, add tests.

rkooo567 · 2022-10-06T22:32:50Z

my theory is that the object delegation worker failed in the middle of deleting spilled files, which doesn't handle well in today's spilled object deletion logic.

Is it something we can reproducing from cpp tests or by killing the spill worker in the middle of the test?

rkooo567

Mostly LGTM!

rkooo567 · 2022-10-06T22:37:00Z

src/ray/raylet/local_object_manager.h

+  /// \param num_retries Num of retries allowed in case of failure, zero or negative
+  /// means don't retry.
+  void DeleteSpilledObjects(std::vector<std::string> urls_to_delete,
+                            int64_t num_retries = kDefaultSpilledObjectDeleteRetries);


num_retries_left?

rkooo567 · 2022-10-06T22:37:25Z

src/ray/raylet/local_object_manager.h

@@ -365,6 +377,9 @@ class LocalObjectManager {
  /// The last time a restore log finished.
  int64_t last_restore_log_ns_ = 0;

+  /// The number of failed deletion requests.
+  std::atomic<int64_t> num_failed_deletion_requests_ = 0;


Btw I think this doesn't have to be atomic (it is single threaded)

hmm i thought the io_worker->rpc_client()->DeleteSpilledObjects callback happens in a different thread.

maybe we should start document explicitly where callbacks are run w.r.t threading model lol

rkooo567 · 2022-10-06T22:38:32Z

src/ray/raylet/local_object_manager.cc

@@ -613,6 +624,8 @@ void LocalObjectManager::RecordMetrics() const {
  ray::stats::STATS_spill_manager_request_total.Record(spilled_objects_total_, "Spilled");
  ray::stats::STATS_spill_manager_request_total.Record(restored_objects_total_,
                                                       "Restored");
+  ray::stats::STATS_spill_manager_request_total.Record(num_failed_deletion_requests_,


Hmm Wonder if we need more breakdown;;

type: spill/restore/delete
status: success/failed

i could work on this as part of 2.2.

python/ray/_private/external_storage.py

rickyyx

Are there any release tests we need to trigger for this? Or it was originally from some other failures.

rickyyx · 2022-10-07T15:57:43Z

python/ray/_private/external_storage.py

@@ -418,7 +425,11 @@ def restore_spilled_objects(
    def delete_spilled_objects(self, urls: List[str]):


Feel the API could be better by returning failure urls (e.g. failure in parsing, failure in delete file etc...)

rickyyx · 2022-10-07T15:58:53Z

python/ray/tests/test_object_spilling_2.py

@@ -143,6 +148,39 @@ def wait_until_actor_dead():
    assert_no_thrashing(address["address"])


+@pytest.mark.skipif(platform.system() in ["Windows"], reason="Failing on Windows.")


via Warner Archive on GIPHY

rickyyx · 2022-10-07T15:59:55Z

src/ray/raylet/local_object_manager.cc

-                               << status.ToString();
+                               << status.ToString() << ", retry count: " << num_retries;
+
+                if (num_retries > 0) {


do we need back-off for retry?

rickyyx · 2022-10-07T16:00:23Z

src/ray/raylet/local_object_manager.cc

@@ -613,6 +624,8 @@ void LocalObjectManager::RecordMetrics() const {
  ray::stats::STATS_spill_manager_request_total.Record(spilled_objects_total_, "Spilled");
  ray::stats::STATS_spill_manager_request_total.Record(restored_objects_total_,
                                                       "Restored");
+  ray::stats::STATS_spill_manager_request_total.Record(num_failed_deletion_requests_,


i could work on this as part of 2.2.

rickyyx · 2022-10-07T16:01:46Z

src/ray/raylet/local_object_manager.h

@@ -365,6 +377,9 @@ class LocalObjectManager {
  /// The last time a restore log finished.
  int64_t last_restore_log_ns_ = 0;

+  /// The number of failed deletion requests.
+  std::atomic<int64_t> num_failed_deletion_requests_ = 0;


maybe we should start document explicitly where callbacks are run w.r.t threading model lol

…ay-project#29014) We have noticed spilled objects not deleted even if the job creating those objects finished execution. After reading the code my theory is that the object delegation worker failed in the middle of deleting spilled files, which doesn't handle well in today's spilled object deletion logic. Though I no longer get a reproduction, (which I suspect due to the fix ray-project#26395), we can enhance the failure handle logic when object deletion failed. Signed-off-by: Weichen Xu <[email protected]>

scv119 linked an issue Oct 3, 2022 that may be closed by this pull request

[Core] spilled object leaked. #26994

Closed

rkooo567 self-assigned this Oct 4, 2022

scv119 assigned rickyyx Oct 4, 2022

scv119 force-pushed the no-leak branch from 61df55f to 10a81ef Compare October 5, 2022 22:35

scv119 marked this pull request as ready for review October 6, 2022 20:55

scv119 assigned jjyao Oct 6, 2022

rkooo567 approved these changes Oct 6, 2022

View reviewed changes

scv119 added 15 commits October 7, 2022 00:23

add

abe84cd

add retry

0c23120

add

bcbe804

add

7d2b0cf

add

ac94f1b

add

0cd09e3

retries

5db167e

fix nits

a6a26e4

fix bugs

956baf0

add

b34187f

fix

ea6dce7

add test

fa89674

update

68ea566

add test

8182ff1

comments

e8b36db

scv119 force-pushed the no-leak branch from 2f84655 to e8b36db Compare October 7, 2022 07:23

rickyyx approved these changes Oct 7, 2022

View reviewed changes

scv119 merged commit c40632e into ray-project:master Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Spilled Object Leakage] More robust spilled object deletion #29014

[Core][Spilled Object Leakage] More robust spilled object deletion #29014

scv119 commented Oct 3, 2022 •

edited

Loading

scv119 commented Oct 3, 2022

rkooo567 commented Oct 6, 2022

rkooo567 left a comment

rkooo567 Oct 6, 2022

rkooo567 Oct 6, 2022

scv119 Oct 7, 2022

rickyyx Oct 7, 2022

rkooo567 Oct 6, 2022

rickyyx Oct 7, 2022

rickyyx left a comment

rickyyx Oct 7, 2022

rickyyx Oct 7, 2022

rickyyx Oct 7, 2022

rickyyx Oct 7, 2022

rickyyx Oct 7, 2022

		@@ -418,7 +425,11 @@ def restore_spilled_objects(
		def delete_spilled_objects(self, urls: List[str]):

		@@ -143,6 +148,39 @@ def wait_until_actor_dead():
		assert_no_thrashing(address["address"])


		@pytest.mark.skipif(platform.system() in ["Windows"], reason="Failing on Windows.")

[Core][Spilled Object Leakage] More robust spilled object deletion #29014

[Core][Spilled Object Leakage] More robust spilled object deletion #29014

Conversation

scv119 commented Oct 3, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 commented Oct 3, 2022

rkooo567 commented Oct 6, 2022

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 commented Oct 3, 2022 •

edited

Loading