Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add size property to enable _cp_file #742

Closed
wants to merge 1 commit into from
Closed

Conversation

mukhery
Copy link

@mukhery mukhery commented May 29, 2023

fsspec's generic filesystem _cp_file functionality checks the size of the file before reading it, causing the currently s3 async implementation to fail. This PR calls _info to retrieve and cache the file size when needed. Once this bug in fsspec is fixed (fsspec/filesystem_spec#1281) it should be possible to use the generic filesystem cp functionality with s3fs.

@mukhery
Copy link
Author

mukhery commented Jun 1, 2023

@martindurant in order to get the size here, needed by _cp_file, we have to call the async _info function, which then makes size async

@martindurant
Copy link
Member

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

Wouldn't the workflow of checking size and then transferring data not work? https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L283
I imagine some like this also enables rsync-like capabilities, where you check size and then transfer only if the size doesn't match.

@martindurant
Copy link
Member

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

That makes sense. Looking at the _cp_file code again, maybe just adding self.size = None in the __init__ for #745 will allow it to work as well, because it will call read once and then subsequent loop iterations will have size defined.

@martindurant
Copy link
Member

OK, I did that.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

closing in favor of #745

@mukhery mukhery closed this Jun 5, 2023
@mukhery mukhery deleted the add_size branch June 11, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants