[tune] Chunk file transfers in cross-node checkpoint syncing #23804

krfricke · 2022-04-08T19:09:44Z

Why are these changes needed?

What: This introduces a general utility to synchronize directories between two nodes, derived from the RemoteTaskClient. This implementation uses chunked transfers for more efficient communication.

Why: Transferring files over 2GB in size leads to superlinear time complexity in some setups (e.g. local macbooks). This could be due to memory limits, swapping, or gRPC limits, and is explored in a different thread. To overcome this limitation, we use chunked data transfers which show quasi-linear scalability for larger files.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/tune/utils/file_transfer.py

matthewdeng · 2022-04-09T00:19:44Z

python/ray/tune/utils/file_transfer.py

+def _pack_dir(
+    source_dir: str, files_stats: Optional[Dict[str, Tuple[float, int]]] = None
+) -> io.BytesIO:
+    """Pack whole directory contents into a uncompressed tarfile.


Any particular reason for not using compression?

nit:

Suggested change

"""Pack whole directory contents into a uncompressed tarfile.

"""Pack whole directory contents into an uncompressed tarfile.

Reopening this question! Would using compression help as an optimization for transferring large directories?

Sorry didn't reply to this earlier. In my benchmarks gzip compression actually added both wallclock time and memory overhead. This will likely depend on the kind of data we're transfering, but my assumption is that we have either large binary data (hard to compress) or small text data (easy to compress), whereas we would get most benefit from compressing large text data.

matthewdeng · 2022-04-09T00:20:42Z

python/ray/tune/utils/file_transfer.py

+# Only export once
+_remote_get_recursive_files_and_stats = ray.remote(_get_recursive_files_and_stats)


I saw this was from the existing code, but for my learning do you know what the purpose of doing this is?

The reason to do this is to only export the remote function once. If you do this in the public API, you'll export the same method multiple times. Think of this as "caching" the remote wrapper

(just fyi if the same method gets exported too many times Ray will raise a warning)

Ah I see! Why not use the decorator pattern?

In a previous iteration we also directly called the non-remote version, but this was removed - I'll update the PR

Actually I'd like to keep both methods, as I might be accessing the non-remote method in a follow up PR. Also since this is private scope it should be ok to have both here

python/ray/tune/utils/file_transfer.py

matthewdeng

Logic looks good to me!

python/ray/tune/sync_client.py

matthewdeng · 2022-04-09T19:27:00Z

python/ray/tune/utils/file_transfer.py

+# Only export once
+_remote_get_recursive_files_and_stats = ray.remote(_get_recursive_files_and_stats)


Ah I see! Why not use the decorator pattern?

python/ray/tune/utils/file_transfer.py

matthewdeng · 2022-04-09T19:47:38Z

python/ray/tune/utils/file_transfer.py

+
+
+@ray.remote
+class _PackActor:


Thoughts on exposing a configurable max_size which can be used to warn or raise an exception if surpassed? (even though this isn't actually exposed to the user today)

Anecdotally I've appreciated runtime envs immediately erroring out when I accidentally include a 50GB dataset 😄 .

I've added a max size bytes argument defaulting to 1 GB, but it is disabled in the sync client. I agree if users call this method they should be protected from these transfers (even if this is not a public API right now). in the sync client we mostly sync checkpoints and actually have to support large file sizes, so a default limit doesn't really make sense.

python/ray/tune/sync_client.py

xwjiang2010 · 2022-04-11T03:48:50Z

python/ray/tune/utils/file_transfer.py

+    chunk_size: int = _DEFAULT_CHUNK_SIZE_BYTES,
+    _return_all_remotes: bool = False,
+) -> Union[ray.ObjectRef, Tuple[ray.ObjectRef, ray.ActorID, ray.ObjectRef]]:
+    """Synchronize directory on source node to directory on target node.


This is probably from last PR...
Is it assumed that source_path and target_path must be both directories or both files? It can't be mixed right?

The function name is sync_dir_between_nodes so it is assumed that both paths are directories

xwjiang2010 · 2022-04-11T03:49:30Z

python/ray/tune/utils/file_transfer.py

+    target_path: str,
+    force_all: bool = False,
+    chunk_size: int = _DEFAULT_CHUNK_SIZE_BYTES,
+    _return_all_remotes: bool = False,


Is it ever used with False?

I've updated the API, and yes, if we end up promoting this to a public API most use cases will call it with False. I have a follow up script for this ready

python/ray/tune/utils/file_transfer.py

xwjiang2010 · 2022-04-11T04:01:28Z

python/ray/tune/utils/file_transfer.py

+    chunk_size: int = _DEFAULT_CHUNK_SIZE_BYTES,
+    _return_all_remotes: bool = False,
+) -> Union[ray.ObjectRef, Tuple[ray.ObjectRef, ray.ActorID, ray.ObjectRef]]:
+    """Synchronize directory on source node to directory on target node.


Can we beef up that this is only about kicking off the sync/scheduling the sync rather than really having it synced by the time the function returns?

I've just updated the public functions, it defaults to blocking calls and only returns futures when return_futures=True

xwjiang2010 · 2022-04-11T04:06:54Z

python/ray/tune/utils/file_transfer.py

+_DEFAULT_CHUNK_SIZE_BYTES = 500 * 1024 * 1024
+
+
+def sync_dir_between_nodes(


Not to be part of this PR, but would be good to have some benchmark like for this file size, takes this amount of time to sync between nodes, assuming there is not a variance (so that this benchmark is meaningful as a reference)
Also this sort of opens up the possibility of syncing arbitrarily large size files by chunking. Would also be nice to have some safety measurement to warn if slow, since everything is pretty much running on one thread.

The calling method has a warning if this is slow. IMO this should also always stay part of the enclosing function

xwjiang2010 · 2022-04-11T04:08:27Z

python/ray/tune/sync_client.py

+        except Exception as e:
+            logger.error(
+                f"Could not delete path {target} on remote node {node_ip}: {e}"
+            )

    def wait(self):


Since we can sync arbitrarily large file size now, we may run into this method more often than before and end up in a blocking situation. We need to think of how to supply visibility if/when this happens.

Fwiw, we synced arbitrarily large files before as well, just with rsync and not with remote tasks. And we do warn already: with warn_if_slow("callbacks.on_trial_save"), though we may want to think about making this message a bit more insightful

…-sync

python/ray/tune/utils/file_transfer.py

Yard1 · 2022-04-11T17:32:52Z

@krfricke CI fails on test_sync

gjoliver

looks pretty good to me. please do give the other folks a chance to look at the updated version.

gjoliver · 2022-04-11T16:46:26Z

python/ray/tune/sync_client.py

@@ -523,16 +453,18 @@ class RemoteTaskClient(SyncClient):
    will not kill the previous sync command, so it may still be executed.
    """

-    def __init__(self, store_pack_future: bool = False):
+    def __init__(self, _store_remotes: bool = False):


any reason to make this a private argument? store_remotes should be good?

It is only needed for testing (so we can access and inspect the futures), so it's nothing a user would usually do or use

oh, I see. can you please these it super clear that nobody should flip this parameter unless for testing.
like a doc string please?

I can add this in the next update (coming soon :-) ) - I think it's not urgent as users never instantiate SyncClients themselves. It's an internal concept and it's instantiated by Ray Tune automatically. So nobody ever calls SomeSyncClient(..).

python/ray/tune/tests/test_sync.py

Co-authored-by: matthewdeng <[email protected]>

…-sync

[tune] Chunk file transfers in cross-node checkpoint syncing

a9ce537

krfricke requested review from Yard1, matthewdeng, xwjiang2010 and gjoliver April 8, 2022 19:10

krfricke assigned matthewdeng, gjoliver, Yard1 and xwjiang2010 Apr 8, 2022

Re-order public functions to top

c79fd10

matthewdeng reviewed Apr 9, 2022

View reviewed changes

Kai Fricke added 2 commits April 8, 2022 18:09

Apply suggestions from code review

6ce95f4

nit typo

90c8ff1

matthewdeng reviewed Apr 9, 2022

View reviewed changes

Yard1 reviewed Apr 10, 2022

View reviewed changes

python/ray/tune/sync_client.py Outdated Show resolved Hide resolved

xwjiang2010 reviewed Apr 11, 2022

View reviewed changes

Kai Fricke added 4 commits April 11, 2022 00:52

Merge remote-tracking branch 'upstream/master' into tune/chunked-file…

7c9782f

…-sync

Add tests, default to blocking operations

6043c66

Fix generator type hint

4cf906d

Add max_size argument

bbf2f12

xwjiang2010 reviewed Apr 11, 2022

View reviewed changes

python/ray/tune/utils/file_transfer.py Outdated Show resolved Hide resolved

Update docstring

9f6900f

xwjiang2010 approved these changes Apr 11, 2022

View reviewed changes

Yard1 approved these changes Apr 11, 2022

View reviewed changes

Fix private attribute access

06ceadf

gjoliver approved these changes Apr 11, 2022

View reviewed changes

matthewdeng approved these changes Apr 11, 2022

View reviewed changes

python/ray/tune/tests/test_sync.py Outdated Show resolved Hide resolved

krfricke and others added 2 commits April 12, 2022 08:56

Update python/ray/tune/tests/test_sync.py

c1202ec

Co-authored-by: matthewdeng <[email protected]>

Merge remote-tracking branch 'upstream/master' into tune/chunked-file…

6cba365

…-sync

Kai Fricke added 2 commits April 12, 2022 08:58

Nit

3e7beb1

logger error -> warning

3af776e

krfricke merged commit 7eb3543 into ray-project:master Apr 12, 2022

krfricke deleted the tune/chunked-file-sync branch April 12, 2022 12:45

krfricke mentioned this pull request Apr 26, 2022

[Bug] [Tune] Checkpoint not found after successful sync down #19192

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Chunk file transfers in cross-node checkpoint syncing #23804

[tune] Chunk file transfers in cross-node checkpoint syncing #23804

krfricke commented Apr 8, 2022

matthewdeng Apr 9, 2022

matthewdeng Apr 9, 2022

krfricke Apr 10, 2022

matthewdeng Apr 9, 2022

krfricke Apr 9, 2022

krfricke Apr 9, 2022

matthewdeng Apr 9, 2022

krfricke Apr 10, 2022

krfricke Apr 11, 2022

matthewdeng left a comment

matthewdeng Apr 9, 2022

matthewdeng Apr 9, 2022

krfricke Apr 11, 2022

xwjiang2010 Apr 11, 2022

krfricke Apr 11, 2022

xwjiang2010 Apr 11, 2022

krfricke Apr 11, 2022

xwjiang2010 Apr 11, 2022

krfricke Apr 11, 2022

xwjiang2010 Apr 11, 2022

krfricke Apr 11, 2022

xwjiang2010 Apr 11, 2022

krfricke Apr 11, 2022

Yard1 commented Apr 11, 2022 •

edited

Loading

gjoliver left a comment

gjoliver Apr 11, 2022

krfricke Apr 12, 2022

gjoliver Apr 12, 2022

krfricke Apr 12, 2022

	"""Pack whole directory contents into a uncompressed tarfile.
	"""Pack whole directory contents into an uncompressed tarfile.

		# Only export once
		_remote_get_recursive_files_and_stats = ray.remote(_get_recursive_files_and_stats)

		_DEFAULT_CHUNK_SIZE_BYTES = 500 * 1024 * 1024


		def sync_dir_between_nodes(

[tune] Chunk file transfers in cross-node checkpoint syncing #23804

[tune] Chunk file transfers in cross-node checkpoint syncing #23804

Conversation

krfricke commented Apr 8, 2022

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Apr 11, 2022 • edited Loading

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Apr 11, 2022 •

edited

Loading