-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching to download_to
and upload_from
#292
Comments
download_to
and upload_from
sidestep cachingas s3 sync
functionality to download_to
and upload_from
via caching
as s3 sync
functionality to download_to
and upload_from
via cachingaws s3 sync
functionality to download_to
and upload_from
via caching
aws s3 sync
functionality to download_to
and upload_from
via cachingdownload_to
and upload_from
Thanks for the comments, @mjkanji. Adding some sort of For this one, I think we can keep it about changing the On that point, I don't think I'm yet convinced it's worth it to implement it on the library side. The intent behind
Just curious, why not? A core reason for this library is so you can call |
Hi @pjbull! Thank you for your reply!
As for why I didn't want to use I'm also working with large sets of partitioned Parquet/Arrow datasets. Arrow allows for many clever optimizations (such as only reading the parts of a file that are relevant, in the case of a filter) and I wasn't really sure about how a (I've since confirmed that I can use a I also see that there has been some discussion around dealing with large cloud data sets in #264 and #96 and I'm just now starting to learn about the dizzying amount of options for, and minutia about, file systems! Coming back to caching for
(Please feel free to correct my definition or conception of what a Then, to my mind, caching, as implemented by this library, is just a specific form of With all that said, why implement it natively (via an argument that lets you control this behavior)? Because, for one, it's more intuitive. For example, when I read the caching page in your docs, I immediately thought this is the The same thing for why not explicitly check |
syncing is a separate issue (and there is already an issue for it #134). For example, we never delete files on sync, and we don't check for and propagate metadata changes. We'd definitely accept an update to the docs to note that If we did consider an implementation here, I think as you point out it would be a I'm not particularly inclined to add that complexity, because I think keeping the caching as isolated as possible is important (see #10). Additionally, I think that nearly everything a user wants to do from an upload/download perspective can be done efficiently with the currently supported APIs, so pitfalls there may be documentation issues. |
Is there any reason why
download_to
andupload_from
do not support the caching mechanism and, instead, download from the source/cloud directly again and again?The only disadvantage of the
download_to
method using caching is that you have twice the storage use (the same file is saved in the cache and inlocal_path
). But, on the other hand, repeatedly downloading the same data if it hasn't changed also seems rather wasteful to me, especially if it's a large folder. Copying to/from the cache seems much better.Maybe we can add an argument such as
force_download
oruse_caching
to allow users to enable caching, while keeping the current behaviour the default ( to avoid unexpectedly using twice storage space for those who don't want it).Alternatively, a cleverer approach that sidesteps the 2x storage use is to make
download_to
andupload_from
obey the same mechanisms that files in the cache obey when deciding whether to re-download a file or not: check if something already exists in the path given, compared with the cloud version to see if it's outdated, and if not, don't download again.Finally, what's the use case for this? I don't want to use
open
on aCloudPath
object directly. For one, it's easier for the user to understand that the file is available locally and they can see it somewhere they define.Basically, I'm proposing enabling something like the AWSCLI's
s3 sync
command withincloudpathlib
, since AWS for whatever reason won't add it natively to boto3 itself (someone asked more than 7 years ago now).cloudpathlib
already has the nuts and bolts for this built-in, and it's just a matter of enabling the feature.The text was updated successfully, but these errors were encountered: