[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

asfimport · 2022-07-13T08:21:18Z

When try to copy a local path to s3 remote filesystem using pyarrow.fs.copy_files and using default parameter use_threads=True, the system hangs. If use "use_threads=False` the operation must complete ok (but more slow).

My code is:

>>> import pyarrow as pa
>>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xxxxxx")
>>> pa.fs.copy_files("tests/data/payments", "bucket/payments", destination_filesystem=s3fs)
... (don't return)

If check remote s3, all files appear, but the function don't return

Platform: Windows

Reporter: Alejandro Marco Ramos

_{Note: This issue was originally created as ARROW-17064. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

krfricke · 2023-02-07T16:53:13Z

We experience the same issue in Ray, and it's easily reproducible. The issue comes up when requesting a recursive upload of more or equal files than CPUs are available.

For instance, on my MacBook with 8 cores, I can upload a folder with 7 files, but not with 8 files:

mkdir -p /tmp/pa-s3
cd /tmp/pa-s3 
for i in {1..7}; do touch $i.txt; done
# This works
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
for i in {1..8}; do touch $i.txt; done  
# This hangs forever
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"

The problem comes up at least with pyarrow 6-11 and can be avoided with use_threads=False, but this obviously harms performance.

EpsilonPrime · 2023-02-07T18:57:19Z

Thanks for the information, that's very useful in trying to address the problem.

EpsilonPrime · 2023-03-22T01:14:59Z

I've created GH-34671 that may be the solve this issue.

This was referenced Feb 7, 2023

[Tune] Add use_threads=False in pyarrow syncing ray-project/ray#32256

Merged

[Python] pyarrow.fs.copy_files hangs indefinitely #15233

Open

This was referenced Apr 21, 2023

[air] Use filesystem wrapper to exclude files from upload ray-project/ray#34102

Merged

[air] remote_storage: Prefer fsspec filesystems over native pyarrow ray-project/ray#34663

Merged

ericl mentioned this issue Aug 21, 2023

[train] Speed up downloading of large checkpoints ray-project/ray#38695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

asfimport commented Jul 13, 2022

krfricke commented Feb 7, 2023 •

edited

Loading

EpsilonPrime commented Feb 7, 2023

EpsilonPrime commented Mar 22, 2023

[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

Comments

asfimport commented Jul 13, 2022

krfricke commented Feb 7, 2023 • edited Loading

EpsilonPrime commented Feb 7, 2023

EpsilonPrime commented Mar 22, 2023

krfricke commented Feb 7, 2023 •

edited

Loading