Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to detect if cache is dirty and should be uploaded even if file was not opened by cloudpathlib #73

Open
pjbull opened this issue Sep 5, 2020 · 4 comments
Labels
caching enhancement New feature or request

Comments

@pjbull
Copy link
Member

pjbull commented Sep 5, 2020

We all want to live in a world where every Python library hands PathLike objects. This is not that world.

Many libraries need a path to a local file—especially as a string—in order to read that file. We should expose a supported and documented way to get a path to the version in the cache.

This came up for me in working with pyvips where I had to do something like this:

s3p = S3Path("s3://...")

# actually do the download
s3p._refresh_cache()

pyvips.read_image(str(s3p._local))

This is not ideal. One option (as noted here: #72 ) is to override __fspath__ to do the caching and return the local path. Then something like this would work:

pyvips.read_image(str(Path(S3Path("s3://...")))

Another option (or in addition) is to add a property like read_only_local_path_string (🤣 at name)

pyvips.read_image(S3Path("s3://...").read_only_local_path_string)

The big caveat is that we replace the .close method on the buffer if you open for write through CloudPath so that we know you intend to change the file:
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/cloudpath.py#L322-L339

Anything you do to the local path is pretty much read-only since we won't automatically upload. (At least, not without a big change—for example, overriding __del__ on the CloudPath to do the upload if local is newer, or tracking all of the files DL'd by a Client on the Client and having it check modified times on those and upload. In general I am a little worried about automatically assuming a user wants changed files to be uploaded to overwrite things on the cloud.... )

@jayqi
Copy link
Member

jayqi commented Sep 5, 2020

So I think

pyvips.read_image(str(Path(S3Path("s3://...")))

could be slightly more concisely written as

pyvips.read_image(os.fspath(S3Path("s3://..."))

We could also add a method/property like CloudPath.fspath as a shortcut.

@pjbull
Copy link
Member Author

pjbull commented Sep 12, 2020

I like fspath as the name of the property which returns the local cache path as a str. Let's implement that as part of #72. I think this issue can become just about if we want to support writes to the cache being uploaded to the cloud for users that don't use CloudPath.open

I.e., do we want to implement and support something like:

Anything you do to the local path is pretty much read-only since we won't automatically upload. (At least, not without a big change—for example, overriding del on the CloudPath to do the upload if local is newer, or tracking all of the files DL'd by a Client on the Client and having it check modified times on those and upload. In general I am a little worried about automatically assuming a user wants changed files to be uploaded to overwrite things on the cloud.... )

@pjbull pjbull changed the title Public API for local cache file (e.g., read_only_local_path) Add ability to detect if cache is dirty and should be uploaded even if file was not opened by cloudpathlib Sep 12, 2020
@pjbull pjbull added the enhancement New feature or request label Oct 2, 2020
@chapmanjacobd
Copy link

Is this already supported well? I think this issue can be closed

https://cloudpathlib.drivendata.org/caching/#Handling-conflicts

If people don't want to use the force_overwrite_to_cloud kwarg in .Open then they could do custom programming using the errors:

s3p = S3Path("s3://...")

try:
    pyvips.read_image(s3p.fspath)

except OverwriteNewerCloudError:
    s3p._refresh_cache()
    pyvips.read_image(s3p.fspath)

@pjbull
Copy link
Member Author

pjbull commented Jun 5, 2021

This issue is somewhat mitigated by the work that has gone into #140.

This issue tracks the generic case of:

  • Use CloudPath so that file is downloaded to local cache
  • Local cache gets edited in some way we don't know about--e.g., shell out to a separate process that edits the file.
  • User may expect changes to be re-uploaded to the cloud, but they are not.

We could potentially handle this by tracking modified times and uploading the local file in the CloudPath.__del__ method.

This may be a won't fix since I'm not sure there's a reliable and efficient way to do this with general logic. In that case, we could close this with an update to the docs.

That said, I don't think we actually have this handled ATM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants