[Client] chunked get requests #22100

ckw017 · 2022-02-04T00:08:44Z

Why are these changes needed?

Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet.

Proto changes

GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large).

Server changes

Mainly just updating GetObject logic to yield chunks instead of returning

Client changes

At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all).

ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly.

await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…eceived objects

ckw017 · 2022-02-04T22:56:44Z

python/ray/util/client/worker.py

+                    last_seen_chunk = chunk.chunk_id
+                    yield chunk
+                return
+            except grpc.RpcError as e:


Section from here and below is the same as _call_stub

AmeerHajAli · 2022-02-08T14:54:57Z

@mwtian / @iycheng / @ijrsvt , can you please take a look here when you have a chance?

mwtian · 2022-02-08T14:57:52Z

@AmeerHajAli still taking a look.

python/ray/util/client/dataclient.py

ijrsvt

Looks pretty good. If we can find a way to unify the code between ChunkCollector & _call_get_object, that would be amazing (but nothing came to mind when I looked at it).

ijrsvt · 2022-02-08T17:18:19Z

python/ray/util/client/dataclient.py

+        elif chunk_id > self.last_seen_chunk + 1:
+            # A chunk was skipped. This shouldn't happen in practice since
+            # grpc guarantees that chunks will arrive in order.
+            self.callback(
+                RuntimeError(
+                    f"Received chunk {chunk_id} when we expected "
+                    f"{self.last_seen_chunk + 1} for request {response.req_id}"
+                )
+            )
+            return True


Is the else case here fine to ignore?

Should be fine, the else case would be receiving a chunk which we've already seen which is fine to ignore (should already be appended to data)

Maybe log a warning in this case?

python/ray/util/client/server/server.py

python/ray/util/client/worker.py

AmeerHajAli · 2022-02-14T01:52:02Z

@iycheng , @mwtian , can you please take a look?

mwtian

Mostly looks good. I hope we can follow up with simplification of the async and sync get(). Otherwise it will be a maintenance burden later on.

mwtian · 2022-02-14T16:44:44Z

python/ray/util/client/dataclient.py

+        elif chunk_id > self.last_seen_chunk + 1:
+            # A chunk was skipped. This shouldn't happen in practice since
+            # grpc guarantees that chunks will arrive in order.
+            self.callback(
+                RuntimeError(
+                    f"Received chunk {chunk_id} when we expected "
+                    f"{self.last_seen_chunk + 1} for request {response.req_id}"
+                )
+            )
+            return True


Maybe log a warning in this case?

python/ray/util/client/dataclient.py

python/ray/util/client/server/server.py

mwtian · 2022-02-14T17:00:48Z

python/ray/util/client/server/server.py

@@ -377,20 +379,37 @@ def _async_get_object(
            with disable_client_hook():

                def send_get_response(result: Any) -> None:
-                    """Pushes a GetResponse to the main DataPath loop to send
+                    """Pushes GetResponses to the main DataPath loop to send


How much change would it be to consolidate sync and async get()? Maintaining two implementations of get on client and server with chunking and reconnection seems like a burden later on.

I suspect it shouldn't be too bad, opened #22357 to track this

python/ray/util/client/worker.py

python/ray/util/client/dataclient.py

ijrsvt

Overall LGTM, to small comments before merging!

python/ray/util/client/__init__.py

ijrsvt · 2022-02-14T18:50:42Z

python/ray/util/client/dataclient.py

+                    # calls ReleaseObject(). So self.asyncio_waiting_data
+                    # is accessed without holding self.lock. Holding the


Just to clarify, should ReleaseObject not be called while holding a lock?

Yes, I think it's to avoid a problem with object refs being cleaned up while holding locks causing deadlock.

ckw017 · 2022-02-15T22:05:01Z

@AmeerHajAli ready to merge

This reverts commit 9a7979d.

Reverts #22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.

Why are these changes needed? Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet. Proto changes GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large). Server changes Mainly just updating GetObject logic to yield chunks instead of returning Client changes At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all). ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly. await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.

Reverts ray-project#22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.

ckw017 added 15 commits January 24, 2022 11:09

update client.proto

576b285

add chunking logic to rayletdriverservicer

f1db5b4

basic example working

bc35ffc

impl for async gets

281da9c

error on out-of-order chunks

16c2b5a

merge master

e4ac773

drop print

fd8ce0a

reconnect mid stream, allow requesting starting chunk for partially r…

f960d29

…eceived objects

cleanup

d1b324e

lint?

290802b

more cleanup

e2214ac

Merge branch 'master' of github.com:ray-project/ray into chunkedgets

5f70d7f

more cleanup

60479f6

add large object warnings, start_chunk logic for async

a91b664

add test for large async gets

4adae0d

ckw017 commented Feb 4, 2022

View reviewed changes

ckw017 added 2 commits February 4, 2022 14:59

remove print

c4b52e5

more cleanup

5374c40

ckw017 changed the title ~~[wip] chunked get requests~~ [Client] chunked get requests Feb 4, 2022

ckw017 marked this pull request as ready for review February 4, 2022 23:40

ckw017 requested review from AmeerHajAli, ijrsvt and mwtian as code owners February 4, 2022 23:40

ckw017 assigned mwtian Feb 4, 2022

ijrsvt reviewed Feb 8, 2022

View reviewed changes

python/ray/util/client/dataclient.py Outdated Show resolved Hide resolved

merge master

ecf5df8

ijrsvt reviewed Feb 8, 2022

View reviewed changes

move comment to del

a1ea6aa

ckw017 added 4 commits February 8, 2022 09:44

rename _call_get_object, simpler total_chunks

b800ba6

merge master

75d6952

format

f320731

Merge branch 'master' of github.com:ray-project/ray into chunkedgets

82dec86

ckw017 assigned fishbone Feb 14, 2022

ckw017 requested a review from fishbone February 14, 2022 17:17

mwtian approved these changes Feb 14, 2022

View reviewed changes

merge master

e02d9c3

ijrsvt approved these changes Feb 14, 2022

View reviewed changes

ckw017 added 4 commits February 14, 2022 10:52

add total_size asserts

256ed12

log warning on out-of-order chunk

b53e779

chunkcollector comment

5095868

bump protocol version

2d13c34

ckw017 mentioned this pull request Feb 14, 2022

[Enhancement][client] Move synchronous GetObject calls to datapath #22357

Open

2 tasks

fix import

182c506

ijrsvt mentioned this pull request Feb 14, 2022

[client] Chunk PutRequests #22327

Merged

6 tasks

Merge branch 'master' of github.com:ray-project/ray into chunkedgets

6f9cc96

AmeerHajAli merged commit 9a7979d into ray-project:master Feb 15, 2022

fishbone added a commit that referenced this pull request Feb 17, 2022

Revert "[Client] chunked get requests (#22100)"

76142d7

This reverts commit 9a7979d.

fishbone mentioned this pull request Feb 17, 2022

Revert "[Client] chunked get requests" #22455

Merged

fishbone added a commit that referenced this pull request Feb 17, 2022

Revert "[Client] chunked get requests" (#22455)

83257a4

Reverts #22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022

Revert "[Client] chunked get requests" (ray-project#22455)

5232a36

Reverts ray-project#22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.

ckw017 mentioned this pull request Apr 25, 2022

ValueError: Message ray.rpc.DataRequest exceeds maximum protobuf size of 2GB #18378

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Client] chunked get requests #22100

[Client] chunked get requests #22100

ckw017 commented Feb 4, 2022 •

edited

Loading

ckw017 Feb 4, 2022 •

edited

Loading

AmeerHajAli commented Feb 8, 2022

mwtian commented Feb 8, 2022

ijrsvt left a comment

ijrsvt Feb 8, 2022

ckw017 Feb 8, 2022

mwtian Feb 14, 2022

AmeerHajAli commented Feb 14, 2022

mwtian left a comment

mwtian Feb 14, 2022

mwtian Feb 14, 2022

ckw017 Feb 14, 2022

ijrsvt left a comment

ijrsvt Feb 14, 2022

ckw017 Feb 14, 2022

ckw017 commented Feb 15, 2022

		# calls ReleaseObject(). So self.asyncio_waiting_data
		# is accessed without holding self.lock. Holding the

[Client] chunked get requests #22100

[Client] chunked get requests #22100

Conversation

ckw017 commented Feb 4, 2022 • edited Loading

Why are these changes needed?

Proto changes

Server changes

Client changes

Related issue number

Checks

ckw017 Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

AmeerHajAli commented Feb 8, 2022

mwtian commented Feb 8, 2022

ijrsvt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmeerHajAli commented Feb 14, 2022

mwtian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ckw017 commented Feb 15, 2022

ckw017 commented Feb 4, 2022 •

edited

Loading

ckw017 Feb 4, 2022 •

edited

Loading