add size property to enable _cp_file #742

mukhery · 2023-05-29T17:06:25Z

fsspec's generic filesystem _cp_file functionality checks the size of the file before reading it, causing the currently s3 async implementation to fail. This PR calls _info to retrieve and cache the file size when needed. Once this bug in fsspec is fixed (fsspec/filesystem_spec#1281) it should be possible to use the generic filesystem cp functionality with s3fs.

mukhery · 2023-06-01T13:24:09Z

@martindurant in order to get the size here, needed by _cp_file, we have to call the async _info function, which then makes size async

martindurant · 2023-06-05T13:42:53Z

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

mukhery · 2023-06-05T14:01:22Z

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

Wouldn't the workflow of checking size and then transferring data not work? https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L283
I imagine some like this also enables rsync-like capabilities, where you check size and then transfer only if the size doesn't match.

martindurant · 2023-06-05T14:06:47Z

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

mukhery · 2023-06-05T14:21:47Z

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

That makes sense. Looking at the _cp_file code again, maybe just adding self.size = None in the __init__ for #745 will allow it to work as well, because it will call read once and then subsequent loop iterations will have size defined.

martindurant · 2023-06-05T15:08:00Z

OK, I did that.

mukhery · 2023-06-05T15:25:48Z

closing in favor of #745

add size property to enable _cp_file

4161cf5

mukhery mentioned this pull request Jun 1, 2023

bugfix: await on size, assuming it can be an async function fsspec/filesystem_spec#1281

Open

mukhery closed this Jun 5, 2023

mukhery deleted the add_size branch June 11, 2023 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add size property to enable _cp_file #742

add size property to enable _cp_file #742

mukhery commented May 29, 2023

mukhery commented Jun 1, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023

add size property to enable _cp_file #742

add size property to enable _cp_file #742

Conversation

mukhery commented May 29, 2023

mukhery commented Jun 1, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023

martindurant commented Jun 5, 2023

mukhery commented Jun 5, 2023