-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Spilled Object Leakage] More robust spilled object deletion #29014
Conversation
TBD, add tests. |
Is it something we can reproducing from cpp tests or by killing the spill worker in the middle of the test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM!
/// \param num_retries Num of retries allowed in case of failure, zero or negative | ||
/// means don't retry. | ||
void DeleteSpilledObjects(std::vector<std::string> urls_to_delete, | ||
int64_t num_retries = kDefaultSpilledObjectDeleteRetries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_retries_left?
@@ -365,6 +377,9 @@ class LocalObjectManager { | |||
/// The last time a restore log finished. | |||
int64_t last_restore_log_ns_ = 0; | |||
|
|||
/// The number of failed deletion requests. | |||
std::atomic<int64_t> num_failed_deletion_requests_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I think this doesn't have to be atomic (it is single threaded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i thought the io_worker->rpc_client()->DeleteSpilledObjects
callback happens in a different thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should start document explicitly where callbacks are run w.r.t threading model lol
@@ -613,6 +624,8 @@ void LocalObjectManager::RecordMetrics() const { | |||
ray::stats::STATS_spill_manager_request_total.Record(spilled_objects_total_, "Spilled"); | |||
ray::stats::STATS_spill_manager_request_total.Record(restored_objects_total_, | |||
"Restored"); | |||
ray::stats::STATS_spill_manager_request_total.Record(num_failed_deletion_requests_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm Wonder if we need more breakdown;;
type: spill/restore/delete
status: success/failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could work on this as part of 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any release tests we need to trigger for this? Or it was originally from some other failures.
@@ -418,7 +425,11 @@ def restore_spilled_objects( | |||
def delete_spilled_objects(self, urls: List[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel the API could be better by returning failure urls (e.g. failure in parsing, failure in delete file etc...)
@@ -143,6 +148,39 @@ def wait_until_actor_dead(): | |||
assert_no_thrashing(address["address"]) | |||
|
|||
|
|||
@pytest.mark.skipif(platform.system() in ["Windows"], reason="Failing on Windows.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< status.ToString(); | ||
<< status.ToString() << ", retry count: " << num_retries; | ||
|
||
if (num_retries > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need back-off for retry?
@@ -613,6 +624,8 @@ void LocalObjectManager::RecordMetrics() const { | |||
ray::stats::STATS_spill_manager_request_total.Record(spilled_objects_total_, "Spilled"); | |||
ray::stats::STATS_spill_manager_request_total.Record(restored_objects_total_, | |||
"Restored"); | |||
ray::stats::STATS_spill_manager_request_total.Record(num_failed_deletion_requests_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could work on this as part of 2.2.
@@ -365,6 +377,9 @@ class LocalObjectManager { | |||
/// The last time a restore log finished. | |||
int64_t last_restore_log_ns_ = 0; | |||
|
|||
/// The number of failed deletion requests. | |||
std::atomic<int64_t> num_failed_deletion_requests_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should start document explicitly where callbacks are run w.r.t threading model lol
…ay-project#29014) We have noticed spilled objects not deleted even if the job creating those objects finished execution. After reading the code my theory is that the object delegation worker failed in the middle of deleting spilled files, which doesn't handle well in today's spilled object deletion logic. Though I no longer get a reproduction, (which I suspect due to the fix ray-project#26395), we can enhance the failure handle logic when object deletion failed. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
We have noticed spilled objects not deleted even if the job creating those objects finished execution. After reading the code my theory is that the object delegation worker failed in the middle of deleting spilled files, which doesn't handle well in today's spilled object deletion logic.
Though I no longer get a reproduction, (which I suspect due to the fix #26395), we can enhance the failure handle logic when object deletion failed.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.