Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify AddSpilledUrl into UpdateObjectLocationBatch RPC #23872

Merged
merged 13 commits into from
Apr 18, 2022

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Apr 12, 2022

Why are these changes needed?

  • Logically these two rpcs are about notifying the owner about the object location changes, so we should just have one rpc for that purpose. This prevents out-of-order updates seen by the owner (i.e. receiving object removed from object store before spill update). Also by using UpdateObjectLocationBatch, we get batch update for free.
  • Maintain a FIFO order for object location updates so we won't have starvation.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

src/ray/protobuf/core_worker.proto Outdated Show resolved Hide resolved
<< " and object directory has been informed";
}
});
object_directory_->ReportObjectSpilled(object_id,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be thread safe since it's running in the io_service thread. @rkooo567 Could you double check here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. If this is used within the node manager io service (main_service_ in object manager), it should be okay. (so I think it should be thread-safe now)

@jjyao jjyao changed the title [WIP] Unify AddSpilledUrl into UpdateObjectLocationBatch RPC Unify AddSpilledUrl into UpdateObjectLocationBatch RPC Apr 13, 2022
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good to see code being removed!

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
@@ -877,6 +877,63 @@ TEST_F(SingleNodeTest, TestObjectInterface) {
ASSERT_TRUE(results[1]->IsException());
}

TEST_F(SingleNodeTest, TestHandleUpdateObjectLocationBatch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for adding this test, but I think the existing Python tests are probably enough for this PR... I've actually been meaning to deprecate this test suite since the maintainability is not really worth the test coverage compared to e2e integration tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also iirc, this test doesn't even run in the master CI!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I also realized that. Also the current core_worker_test is like semi integration test, it needs to launch gcs, raylet, etc. We should try to make it more unit-testable via techniques like dependency ingestion.

src/ray/object_manager/ownership_based_object_directory.h Outdated Show resolved Hide resolved
src/ray/protobuf/core_worker.proto Outdated Show resolved Hide resolved
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A couple small comments

@@ -877,6 +877,63 @@ TEST_F(SingleNodeTest, TestObjectInterface) {
ASSERT_TRUE(results[1]->IsException());
}

TEST_F(SingleNodeTest, TestHandleUpdateObjectLocationBatch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also iirc, this test doesn't even run in the master CI!

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/reference_count.cc Show resolved Hide resolved
src/ray/object_manager/ownership_based_object_directory.cc Outdated Show resolved Hide resolved
src/ray/object_manager/ownership_based_object_directory.h Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Show resolved Hide resolved
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good! Left some minor comments.

I noticed the PR also includes FIFO location updates now; could you update the PR description to include that and also add a C++ test?

src/ray/protobuf/core_worker.proto Outdated Show resolved Hide resolved
@jjyao
Copy link
Collaborator Author

jjyao commented Apr 18, 2022

dataset_shuffle_random_shuffle_1tb passed:

time = 824.6457576751709
  success = 1
  num_partitions = 1000
  partition_size = 1000000000
  perf_metrics = [{'perf_metric_name': 'peak_driver_memory', 'perf_metric_value': 6128968000, 'perf_metric_type': 'MEMORY'}, {'perf_metric_name': 'runtime', 'perf_metric_value': 824.6457576751709, 'perf_metric_type': 'LATENCY'}]

@jjyao jjyao merged commit 5d7f45f into ray-project:master Apr 18, 2022
@jjyao jjyao deleted the jjyao/location branch April 18, 2022 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants