Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" #32372

Open
asfimport opened this issue Jul 13, 2022 · 3 comments

Comments

@asfimport
Copy link
Collaborator

When try to copy a local path to s3 remote filesystem using pyarrow.fs.copy_files and using default parameter use_threads=True, the system hangs. If use "use_threads=False` the operation must complete ok (but more slow).

 

My code is:

>>> import pyarrow as pa
>>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xxxxxx")
>>> pa.fs.copy_files("tests/data/payments", "bucket/payments", destination_filesystem=s3fs)
... (don't return)

If check remote s3, all files appear, but the function don't return

 

Platform: Windows

Reporter: Alejandro Marco Ramos

Note: This issue was originally created as ARROW-17064. Please see the migration documentation for further details.

@krfricke
Copy link
Contributor

krfricke commented Feb 7, 2023

We experience the same issue in Ray, and it's easily reproducible. The issue comes up when requesting a recursive upload of more or equal files than CPUs are available.

For instance, on my MacBook with 8 cores, I can upload a folder with 7 files, but not with 8 files:

mkdir -p /tmp/pa-s3
cd /tmp/pa-s3 
for i in {1..7}; do touch $i.txt; done
# This works
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
for i in {1..8}; do touch $i.txt; done  
# This hangs forever
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"

The problem comes up at least with pyarrow 6-11 and can be avoided with use_threads=False, but this obviously harms performance.

@EpsilonPrime
Copy link
Contributor

Thanks for the information, that's very useful in trying to address the problem.

@EpsilonPrime
Copy link
Contributor

I've created GH-34671 that may be the solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants