-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cython wrapper for GcsClient #33769
Conversation
Signed-off-by: Philipp Moritz <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
This is nice! Looks like there's quite some plumbing - could you help update the PR description about what are the changes that are required as a result of this? |
@@ -113,6 +113,7 @@ enum class StatusCode : char { | |||
// out of disk. | |||
OutOfDisk = 28, | |||
ObjectUnknownOwner = 29, | |||
RpcError = 30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference of this with GrpcUnknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RpcError
can pass through other error codes from gRPC as well, not only the UNKNOWN
error. The ones I have seen being used downstream as part of this PR:
UNAVAILABLE
UNKNOWN (both of the above will lead to the client retrying)
DEADLINE_EXCEEDED (I think this is basically a timeout)
RESOURCE_EXHAUSTED ("message too big")
I don't have strong opinions, but ideally I think we would have a well-defined set of RPC error messages that we expose to clients that are interpretable and clear to handle, so we don't lock ourselves into gRPC implementation details. This is however out of the scope for this PR, I was just trying to do the minimal amount of changes to keep the downstream code working here :) cc @scv119
Let me know if you think there is a better strategy to handle things for this PR that is still incremental!
I do have a follow up PR that exposes the error codes used today in the Python code through Cython as well so we can get rid of more of the grpcio package .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you are interested, this is how the follow up PR looks like so far: ec439b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - makes sense to me. So the long term plan will be have a generic RpcError which takes in more specific error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really on the Ray Core team to decide. If you prefer to do it differently for this PR, let me know. On the Python level at least for this PR I think having the RpcError makes sense given what we had before since it minimizes the changes. On the C++ side, either one can work (i.e. we can either have a combined RpcError or unwind the errors into different top level ray::Status errors) and I personally don't really have a preference, happy to change it. It is an easy change (while thinking about this, I found I'm misusing Status::GrpcUnknown and Status::GrpcUnavailable -- I fixed that -- it doesn't make a difference in the behavior since they all get converted into a RaySystemError on the Python side, which for example happens if your namespace name is not valid).
src/ray/gcs/gcs_client/gcs_client.h
Outdated
@@ -184,6 +184,31 @@ class RAY_EXPORT GcsClient : public std::enable_shared_from_this<GcsClient> { | |||
std::function<void()> resubscribe_func_; | |||
}; | |||
|
|||
class RAY_EXPORT GcsSyncClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we then name it more explicitly?
explicit GcsSyncClient(const GcsClientOptions &options); | ||
Status Connect(); | ||
|
||
Status InternalKVGet(const std::string &ns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason that we could not reuse the existing GCS client's implementation and have this Cython client simply a wrapper? I.e. what are the reasons that we could not use the InternalKVAccessor
anymore in this class, is this a Cython limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I actually considered this -- the current interface of InternalKVAccessor
seemed more appropriate for an asynchronous interface and the methods that were needed here were really simple to implement, so I discarded the idea. For the methods that do exist in the InternalKVAccessor and are synchronous, they basically all have the wrong signature (missing timeouts, missing information about the actions that have been taken for the updates or having the wrong types like bool vs. int). If you strongly feel the InternalKVAccessor
should be reused, I could probably make that happen, but I tried to not modify Ray internals as part of this PR to keep it more incremental and risk-free.
I also first tried to reuse the whole existing C++ GCS client but that is basically impossible since it is deeply coupled with the async implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you strongly feel the InternalKVAccessor should be reused, I could probably make that happen, but I tried to not modify Ray internals as part of this PR to keep it more incremental and risk-free.
Gotcha - no, i don't have a strong preference for reusing that one. I guess my intuition was just these two could be unified.
So it sounds like maybe in a future PR, we could actually refactor the internal KV accessor to have the proper signatures, and we could unify the imlementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds great -- whether this is worth doing also depend on whether this client is going to be used from C++ in the future. If it is, the value of doing the unification is definitely higher :)
@rickyyx I updated the PR description and also happy to rename the sync GCS client, e.g. to PythonGcsClient :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient. This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes. This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code. We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working. Signed-off-by: elliottower <[email protected]>
More progress along the lines of #33769 to remove Python gRPC from Ray Core.
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient. This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from `ray._private.gcs_utils` to `ray._raylet`. The existing Python level reconnection logic `_auto_reconnect` is reused almost without changes. This new Cython client can support the full use cases of the old pure Python `GcsClient` and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code. We also introduce a new exception type `ray.exceptions.RpcError` which is a replacement of `grpc.RpcError` and allows the Python level code that does exception handling to keep working. Signed-off-by: Jack He <[email protected]>
More progress along the lines of ray-project#33769 to remove Python gRPC from Ray Core. Signed-off-by: Jack He <[email protected]>
More progress along the lines of ray-project#33769 to remove Python gRPC from Ray Core.
Why are these changes needed?
This is with the eventual goal of removing Python gRPC calls from Ray Core / Python workers. As a first cut, I'm removing the Python GcsClient.
This PR introduces a Cython GcsClient that wraps a simple C++ synchronous GCS client. As a result, the code for the GcsClient moves from
ray._private.gcs_utils
toray._raylet
. The existing Python level reconnection logic_auto_reconnect
is reused almost without changes.This new Cython client can support the full use cases of the old pure Python
GcsClient
and is (almost) a drop in replacement. To make sure this is indeed the case, this PR also switches over all the uses of the old client and removes the old code.We also introduce a new exception type
ray.exceptions.RpcError
which is a replacement ofgrpc.RpcError
and allows the Python level code that does exception handling to keep working.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.