Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Use filesystem wrapper to exclude files from upload #34102

Merged
merged 5 commits into from
Apr 25, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Apr 5, 2023

Why are these changes needed?

Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call pyarrow.fs.copy_files on single non-excluded files.

This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files.

Edit: See here for benchmark results

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@krfricke
Copy link
Contributor Author

krfricke commented Apr 5, 2023

I'll run the restore example with these wheels to check if it improves performance

@krfricke
Copy link
Contributor Author

krfricke commented Apr 5, 2023

It does not seem to significantly impact the upload time. I still think this is generally a cleaner version to go for, but I'll look more into this.
Converting to draft for now pending further investigation.

@krfricke krfricke marked this pull request as draft April 5, 2023 17:46
@krfricke
Copy link
Contributor Author

I ran a few benchmarks to compare different versions of our code.

Data: 5 files a 2MB in one folder

Compared methods:

  • awscli_sync: Direct usage of aws s3 sync
  • pyarrow_threads: Direct usage of pyarrow.fs.copy_files(source, target, use_threads=False)
  • pyarrow_no_threads: Direct usage of pyarrow.fs.copy_files(source, target, use_threads=True)
  • air_default: Usage of AIR's upload_to_uri(source, target) utility. Defaults to no-threads for S3
  • air_exclude_threads: upload_to_uri(source, target, exclude=something) and with threading enabled
  • air_exclude_no_threads: upload_to_uri(source, target, exclude=something) and with threading disabled (default)
  • air_patched_threads: Version of this PR with threading enabled
  • air_patched_no_threads: Version of this PR with threading disabled.

Results:

Results:
    awscli_sync:         0.76 (0.01) seconds
    pyarrow_threads:     0.69 (0.04) seconds
    pyarrow_no_threads:  1.19 (0.07) seconds
    air_default:         1.19 (0.08) seconds
    air_exclude_threads:         1.43 (0.02) seconds
    air_exclude_no_threads:      1.32 (0.03) seconds
    air_patched_threads:         0.53 (0.01) seconds
    air_patched_no_threads:      1.23 (0.08) seconds

The TLDR of these results is that threading gives an (obvious) advantage.

However, due to apache/arrow#32372, we currently can't enable threading per default.

But, assuming apache/arrow#32372 is fixed at some point, our current exclude logic can't leverage threading, as we call pyarrow.fs.copy_files multiple times, sequentially. Thus we see that air_exclude_threads has the same bad performance as pyarrow_no_threads or air_exclude_no_threads

With the patch from this PR, we can regain the threading benefit, once the pyarrow issue is resolved, as shown in the good performance of air_patched_threads.

Thus, we should aim to land this PR.

I'll work on a workaround for the pyarrow issue separately.

@krfricke krfricke marked this pull request as ready for review April 21, 2023 13:35
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Could we also add a unit test that tests upload_to_uri with and without fsspec? Maybe you can monkeypatch fsspec to None to switch to default.

Comment on lines +290 to +293
fsspec_msg = (
"If your data is small, try installing fsspec "
"(`pip install fsspec`) for more efficient local file parsing. "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ffspec always be recommended, or is it only beneficial for many, small files? Does it perform worse for few, large files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial benchmarks suggested that it always performed better. After #34663 any implementation efficiencies will be hugely shadowed by being able to leverage multi threaded uploads, so we should always recommend fsspec (we'll be printing warnings in #34663 if it's not installed).


self._slow_sync_threshold = float(
os.environ.get(
"TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S", "30"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should always warn users if they don't have fsspec, rather than just if their exp syncing is taking a while? The threshold might mask the problems and the user would never know to do something.

Will there be a separate warning checking if using any cloud storage + no fsspec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I have added the warnings in the follow-up PR: #34663

@krfricke
Copy link
Contributor Author

Added a test, ptal!

@krfricke krfricke requested a review from justinvyu April 24, 2023 10:56
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! One nit on another test case. Also, what's the main difference between the 2 unit tests that were added?


tmp_source, tmp_target = temp_data_dirs

upload_to_uri(tmp_source, tmp_target, exclude=["*_exclude.txt", "*_exclude/*"])
Copy link
Contributor

@justinvyu justinvyu Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we include a test case for an exclude pattern like "exclude/"? I remember there is some special logic for the trailing /.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were copied from existing tests that test the same behavior with an in-memory filesystem. The main difference is that in the second case, multiple patterns are tested (not just one).

The exclude/ pattern won't work as it will not match the files contained within the directory - it always has to be exclude/*

@krfricke krfricke merged commit 8728c77 into ray-project:master Apr 25, 2023
@krfricke krfricke deleted the air/remote-storage-exclude branch April 25, 2023 12:08
krfricke added a commit that referenced this pull request Apr 27, 2023
…34663)

Following up from #34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing.

However, due to pyarrow, we currently can't use threading.

Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs.

Signed-off-by: Kai Fricke <[email protected]>
krfricke pushed a commit to krfricke/ray that referenced this pull request Apr 28, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…t#34102)

Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call `pyarrow.fs.copy_files` on single non-excluded files.

This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files.

Edit: [See here for benchmark results](ray-project#34102 (comment))

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Jack He <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ay-project#34663)

Following up from ray-project#34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing.

However, due to pyarrow, we currently can't use threading.

Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs.

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…t#34102)

Ray Tune uploads experiment state using pyarrow. When cloud checkpointing is configured, the driver will exclude any trial-level checkpoints. Pyarrow does not natively support file exclusion, though - instead, we repeatedly call `pyarrow.fs.copy_files` on single non-excluded files.

This seems to be inefficient as the connection to the remote filesystem is opened and closed repeatedly. It also means we can never leverage multi-threaded upload. This PR implements a custom fsspec-based local filesystem that excludes files on the selector level. Thus, we can call pyarrow.fs.copy_files exactly once, with a selector that does not see the excluded files. 

Edit: [See here for benchmark results](ray-project#34102 (comment))

Signed-off-by: Kai Fricke <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…ay-project#34663)

Following up from ray-project#34102, our benchmarks showed how inferior unthreaded syncing is over threaded syncing.

However, due to pyarrow, we currently can't use threading.

Since the bug affects all pyarrow versions <=11, which will likely be used by some users in the future, we have to look into workarounds. One such workaround is to use the fsspec-provided filesystem and prefer it over the native pyarrow fs.

Signed-off-by: Kai Fricke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants