Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cython wrapper for GcsClient #33769

Merged
merged 73 commits into from
Apr 12, 2023
Merged

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Mar 27, 2023

Why are these changes needed?

This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient.

This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from ray._private.gcs_utils to ray._raylet. The existing Python level reconnection logic _auto_reconnect is reused almost without changes.

This new Cython client can support the full use cases of the old pure Python GcsClient and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code.

We also introduce a new exception type ray.exceptions.RpcError which is a replacement of grpc.RpcError and allows the Python level code that does exception handling to keep working.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pcmoritz pcmoritz requested a review from a team as a code owner March 27, 2023 23:16
@pcmoritz pcmoritz marked this pull request as draft March 27, 2023 23:17
@pcmoritz pcmoritz removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 8, 2023
@rickyyx
Copy link
Contributor

rickyyx commented Apr 10, 2023

This is nice! Looks like there's quite some plumbing - could you help update the PR description about what are the changes that are required as a result of this?

@@ -113,6 +113,7 @@ enum class StatusCode : char {
// out of disk.
OutOfDisk = 28,
ObjectUnknownOwner = 29,
RpcError = 30,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference of this with GrpcUnknown?

Copy link
Contributor Author

@pcmoritz pcmoritz Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RpcError can pass through other error codes from gRPC as well, not only the UNKNOWN error. The ones I have seen being used downstream as part of this PR:

UNAVAILABLE
UNKNOWN (both of the above will lead to the client retrying)
DEADLINE_EXCEEDED (I think this is basically a timeout)
RESOURCE_EXHAUSTED ("message too big")

I don't have strong opinions, but ideally I think we would have a well-defined set of RPC error messages that we expose to clients that are interpretable and clear to handle, so we don't lock ourselves into gRPC implementation details. This is however out of the scope for this PR, I was just trying to do the minimal amount of changes to keep the downstream code working here :) cc @scv119

Let me know if you think there is a better strategy to handle things for this PR that is still incremental!

I do have a follow up PR that exposes the error codes used today in the Python code through Cython as well so we can get rid of more of the grpcio package .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you are interested, this is how the follow up PR looks like so far: ec439b6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - makes sense to me. So the long term plan will be have a generic RpcError which takes in more specific error code?

Copy link
Contributor Author

@pcmoritz pcmoritz Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really on the Ray Core team to decide. If you prefer to do it differently for this PR, let me know. On the Python level at least for this PR I think having the RpcError makes sense given what we had before since it minimizes the changes. On the C++ side, either one can work (i.e. we can either have a combined RpcError or unwind the errors into different top level ray::Status errors) and I personally don't really have a preference, happy to change it. It is an easy change (while thinking about this, I found I'm misusing Status::GrpcUnknown and Status::GrpcUnavailable -- I fixed that -- it doesn't make a difference in the behavior since they all get converted into a RaySystemError on the Python side, which for example happens if your namespace name is not valid).

@@ -184,6 +184,31 @@ class RAY_EXPORT GcsClient : public std::enable_shared_from_this<GcsClient> {
std::function<void()> resubscribe_func_;
};

class RAY_EXPORT GcsSyncClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we then name it more explicitly?

explicit GcsSyncClient(const GcsClientOptions &options);
Status Connect();

Status InternalKVGet(const std::string &ns,
Copy link
Contributor

@rickyyx rickyyx Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason that we could not reuse the existing GCS client's implementation and have this Cython client simply a wrapper? I.e. what are the reasons that we could not use the InternalKVAccessor anymore in this class, is this a Cython limitation?

Copy link
Contributor Author

@pcmoritz pcmoritz Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I actually considered this -- the current interface of InternalKVAccessor seemed more appropriate for an asynchronous interface and the methods that were needed here were really simple to implement, so I discarded the idea. For the methods that do exist in the InternalKVAccessor and are synchronous, they basically all have the wrong signature (missing timeouts, missing information about the actions that have been taken for the updates or having the wrong types like bool vs. int). If you strongly feel the InternalKVAccessor should be reused, I could probably make that happen, but I tried to not modify Ray internals as part of this PR to keep it more incremental and risk-free.

I also first tried to reuse the whole existing C++ GCS client but that is basically impossible since it is deeply coupled with the async implementation.

Copy link
Contributor

@rickyyx rickyyx Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you strongly feel the InternalKVAccessor should be reused, I could probably make that happen, but I tried to not modify Ray internals as part of this PR to keep it more incremental and risk-free.

Gotcha - no, i don't have a strong preference for reusing that one. I guess my intuition was just these two could be unified.

So it sounds like maybe in a future PR, we could actually refactor the internal KV accessor to have the proper signatures, and we could unify the imlementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds great -- whether this is worth doing also depend on whether this client is going to be used from C++ in the future. If it is, the value of doing the unification is definitely higher :)

@pcmoritz
Copy link
Contributor Author

@rickyyx I updated the PR description and also happy to rename the sync GCS client, e.g. to PythonGcsClient :)

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pcmoritz pcmoritz merged commit 7c9da5c into ray-project:master Apr 12, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient.

This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes.

This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code.

We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working.

Signed-off-by: elliottower <[email protected]>
pcmoritz added a commit that referenced this pull request Apr 26, 2023
More progress along the lines of #33769 to remove Python gRPC from Ray Core.
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient.

This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes.

This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code.

We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working.

Signed-off-by: Jack He <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
More progress along the lines of ray-project#33769 to remove Python gRPC from Ray Core.

Signed-off-by: Jack He <[email protected]>
pcmoritz added a commit that referenced this pull request May 5, 2023
pcmoritz added a commit to pcmoritz/ray-1 that referenced this pull request May 9, 2023
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
More progress along the lines of ray-project#33769 to remove Python gRPC from Ray Core.
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants