-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Use filesystem wrapper to exclude files from upload #34102
[air] Use filesystem wrapper to exclude files from upload #34102
Conversation
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
I'll run the restore example with these wheels to check if it improves performance |
It does not seem to significantly impact the upload time. I still think this is generally a cleaner version to go for, but I'll look more into this. |
I ran a few benchmarks to compare different versions of our code. Data: 5 files a 2MB in one folder Compared methods:
Results:
The TLDR of these results is that threading gives an (obvious) advantage. However, due to apache/arrow#32372, we currently can't enable threading per default. But, assuming apache/arrow#32372 is fixed at some point, our current With the patch from this PR, we can regain the threading benefit, once the pyarrow issue is resolved, as shown in the good performance of Thus, we should aim to land this PR. I'll work on a workaround for the pyarrow issue separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Could we also add a unit test that tests upload_to_uri
with and without fsspec
? Maybe you can monkeypatch fsspec to None
to switch to default.
fsspec_msg = ( | ||
"If your data is small, try installing fsspec " | ||
"(`pip install fsspec`) for more efficient local file parsing. " | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ffspec
always be recommended, or is it only beneficial for many, small files? Does it perform worse for few, large files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
self._slow_sync_threshold = float( | ||
os.environ.get( | ||
"TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S", "30" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should always warn users if they don't have fsspec
, rather than just if their exp syncing is taking a while? The threshold might mask the problems and the user would never know to do something.
Will there be a separate warning checking if using any cloud storage + no fsspec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I have added the warnings in the follow-up PR: #34663
Signed-off-by: Kai Fricke <[email protected]>
Added a test, ptal! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! One nit on another test case. Also, what's the main difference between the 2 unit tests that were added?
|
||
tmp_source, tmp_target = temp_data_dirs | ||
|
||
upload_to_uri(tmp_source, tmp_target, exclude=["*_exclude.txt", "*_exclude/*"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we include a test case for an exclude pattern like "exclude/"
? I remember there is some special logic for the trailing /
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests were copied from existing tests that test the same behavior with an in-memory filesystem. The main difference is that in the second case, multiple patterns are tested (not just one).
The exclude/
pattern won't work as it will not match the files contained within the directory - it always has to be exclude/*
…34663) Following up from #34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing. However, due to pyarrow, we currently can't use threading. Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs. Signed-off-by: Kai Fricke <[email protected]>
…y-project#34102)" This reverts commit 8728c77.
…t#34102) Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call `pyarrow.fs.copy_files` on single non-excluded files. This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files. Edit: [See here for benchmark results](ray-project#34102 (comment)) Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Jack He <[email protected]>
…ay-project#34663) Following up from ray-project#34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing. However, due to pyarrow, we currently can't use threading. Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Jack He <[email protected]>
…t#34102) Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call `pyarrow.fs.copy_files` on single non-excluded files. This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files. Edit: [See here for benchmark results](ray-project#34102 (comment)) Signed-off-by: Kai Fricke <[email protected]>
…ay-project#34663) Following up from ray-project#34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing. However, due to pyarrow, we currently can't use threading. Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs. Signed-off-by: Kai Fricke <[email protected]>
Why are these changes needed?
Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call
pyarrow.fs.copy_files
on single non-excluded files.This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files.
Edit: See here for benchmark results
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.