Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.Path is not Path-like #99818

Closed
buhtz opened this issue Nov 27, 2022 · 16 comments
Closed

zipfile.Path is not Path-like #99818

buhtz opened this issue Nov 27, 2022 · 16 comments

Comments

@buhtz
Copy link

buhtz commented Nov 27, 2022

This is about zipfile.Path. The doc says it is compatible to pathlib.Path. But it seems that is not for 100% because it doesn't derive from pathlib.PurePath and can treated as Path-like in all situations.

I was redirected from pandas where I opened an issue about the fact that pandas.read_excel() does accept path-like objects but not zipfile.Path. It was explained to me that zipfile.Path doesn't implement __fspath__ and that is the problem.

Linked PRs

@ericvsmith
Copy link
Member

As explained in pandas-dev/pandas#49906 (comment), I don't think implementing __fspath__ would help your use case. Basically __fspath__ converts the object to a string. There isn't any string that would make sense to be used by any code (for example, the builtin open()) to operate on a file within a zip file.

@twoertwein
Copy link

It might help to hedge the documentation slightly. Maybe something like:

zipfile.Path implements many methods of pathlib.Path but does not inherit from it nor does zipfile.Path implement os.PathLike.

@buhtz
Copy link
Author

buhtz commented Nov 27, 2022

The question is, from your viewpoint as Python core developers, how could the situation be improved.

As an example csv.reader() can take pathlib.Path and zipfile.Path objects. Works fine without a workaround.

Maybe zipfile.Path is not intended to act exactly like a pathlib.Path object? If this is the case then you should be more clear and direct about that limitations in the documentation.

Or do you see a way that 3rd-party-libs like pandas can improve theire code to also accept zipfile.Path objects without explicit checking for that edge case via isinstance(obj, zipfile.Path)? I don't know why csv.reader() works in that case.

@AlexWaygood
Copy link
Member

The doc says it is compatible to pathlib.Path.

There's some subtlety (probably a little too much subtlety) in the docstring you link to there. The docstring says that zipfile.Path is "pathlib-compatible" (I.e., has a similar interface to pathlib.Path). It doesn't say that ZipFile is "path-like" (and nor do the docs), which is a term that's usually used to mean "an object that conforms to the os.PathLike interface by implementing the __fspath__ method". Somewhat confusingly, a class called zipfile.Path can be "similar to pathlib.Path" without being "path-like".

Cc. @barneygale, for interest :)

@barneygale
Copy link
Contributor

FWIW, I'm working towards making zipfile.Path a subclass of PurePath (actually, a subclass of a new AbstractPath class that sits between PurePath and Path). It would allow libraries like pandas to work with zip paths like this (pseudocode):

def read_excel(path):
    if not isinstance(path, pathlib.AbstractPath):
        path = pathlib.Path(path)
    with path.open('rb') as f:
        ...

Discussion here: https://discuss.python.org/t/make-pathlib-extensible/3428.

@barneygale
Copy link
Contributor

The "pathlib-compatible" bit of the docstring should be changed to "similar to pathlib.Path" IMHO. Saying it's "compatible" is overselling it, particularly as zip paths aren't path-like!

@barneygale
Copy link
Contributor

Apologies for triple-posting, I've been thinking about this :)

There isn't any string that would make sense to be used by any code (for example, the builtin open()) to operate on a file within a zip file.

Why not the string representation of the archive member path? The os.PathLike interface definition doesn't say what you can do with the resulting string! Indeed, some functions like os.path.join() and os.path.dirname() call os.fspath() and then apply purely lexical operations on the result. pathlib.PurePath too follows this logic - you can happily open(PureWindowsPath('C:/blah')) from a non-Windows system.

IMHO this points to an overloading of what "path-like" means, that can be remedied by making it more granular. We could introduce "pure" analogues of __fspath__(), os.fspath() and os.PathLike - something like __purepath__(), os.purepath(), os.PurePathLike.

@ericvsmith
Copy link
Member

My point is just that built-in open() is going to call os.fspath(zip_path). There's nothing that zip_path.__fspath__ could return that will make open() work. So maybe __fspath__ isn't what's needed to solve this problem.

@fancidev
Copy link
Contributor

It is probably more accurate to remove the phrase “pathlib-compatible” from the docs or replace it with something like “pathlib.Path-style” or “pathlib.Path-like”.

@buhtz
Copy link
Author

buhtz commented Nov 29, 2022

I really appreciate that you folks invest so much energy in that topic. Please let me throw another question into the pit.

What was the intention about zipfile.Path. When it is not 100% Path-like than this couldn't be the intention to use a zipfile-path as each other path. But this was how I understood it in the first place.

I describe me use-case here; reading data-files (csv, xlsx) from a path object no matter if it is a regular file (pathlib.Path) or an entry in a zip-archive (zipfile.Path). Are there other use cases you had in mind while designing that zipfile.Path?

@jaraco
Copy link
Member

jaraco commented Feb 5, 2023

When designing zipfile.Path, the main use case was for use in abstracting the results from importlib.metadata and importlib.resources when the packages were present in a zipfile. That is, provide access to paths and files of various resources of a package or a distribution's metadata as files in directories. In particular, for importlib.resources, the Traversable protocol was needed. See #88366 for some discussion around Traversable being PathLike and #96870 exploring why Traversable was created. See also python/importlib_resources#232, where @FFY00 and I are discussing the merits and challenges with honoring pathlike objects in importlib.resources.as_file.

Since zipfile.Path is a Traversable, perhaps all you need to do is use as_file?

from importlib.resources import as_file

def read_excel(traversable):
    with as_file(traversable) as path:
        with path.open('rb') as f:
           ...

Improvements I'd like to see:

  • Document the guarantees and limitations of zipfile.Path, perhaps by introducing Traversable.
  • Use type system to declare that zipfile.Path implements Traversable.
  • Possibly highlight as_file as a means to readily get a resource on disk.

@jaraco
Copy link
Member

jaraco commented Feb 5, 2023

FWIW, I'm working towards making zipfile.Path a subclass of PurePath (actually, a subclass of a new AbstractPath class that sits between PurePath and Path). It would allow libraries like pandas to work with zip paths like this (pseudocode):

def read_excel(path):
    if not isinstance(path, pathlib.AbstractPath):
        path = pathlib.Path(path)
    with path.open('rb') as f:
        ...

Wait - if you're using path.open(), why does it need to be cast to a pathlib.Path? zipfile.Path supports .open without any manipulation. In other words, Pandas should be adapted to accept a traversable, and then it will accept zipfile.Path objects and pathlib.Path objects.

@barneygale
Copy link
Contributor

barneygale commented Feb 5, 2023

Bad example on my part. If you're only interested in the interface Traversable provides then sure, you can do:

def read_excel(path):
    if not isinstance(path, Traversable):
        path = pathlib.Path(path)
    with path.open('rb') as f:
        ...

If you're looking for an interface more like pathlib's (so including stuff like parents, with_suffix(), exists(), write_text(), glob() and walk()) then pathlib.AbstractPath might be a good option if/when it arrives.

@FFY00
Copy link
Member

FFY00 commented Feb 5, 2023

I opened GH-101589 to fix the documentation.

If the Traversable interface is enough, you can just use it, otherwise, as Jason mentioned, you should use as_file to get a pathlib.Path instance.

And again, please note that path-like objects are a completely different thing 😅

miss-islington pushed a commit that referenced this issue Feb 20, 2023
carljm added a commit to carljm/cpython that referenced this issue Feb 20, 2023
* main: (60 commits)
  pythongh-102056: Fix a few bugs in error handling of exception printing code (python#102078)
  pythongh-102011: use sys.exception() instead of sys.exc_info() in docs where possible (python#102012)
  pythongh-101566: Sync with zipp 3.14. (pythonGH-102018)
  pythonGH-99818: improve the documentation for zipfile.Path and Traversable (pythonGH-101589)
  pythongh-88233: zipfile: handle extras after a zip64 extra (pythonGH-96161)
  pythongh-101981: Apply HOMEBREW related environment variables (pythongh-102074)
  pythongh-101907: Stop using `_Py_OPCODE` and `_Py_OPARG` macros (pythonGH-101912)
  pythongh-101819: Adapt _io types to heap types, batch 1 (pythonGH-101949)
  pythongh-101981: Build macOS as recommended by the devguide (pythonGH-102070)
  pythongh-97786: Fix compiler warnings in pytime.c (python#101826)
  pythongh-101578: Amend PyErr_{Set,Get}RaisedException docs (python#101962)
  Misc improvements to the float tutorial (pythonGH-102052)
  pythongh-85417: Clarify behaviour on branch cuts in cmath module (python#102046)
  pythongh-100425: Update tutorial docs related to sum() accuracy (FH-101854)
  Add missing 'is' to `cmath.log()` docstring (python#102049)
  pythongh-100210: Correct the comment link for unescaping HTML (python#100212)
  pythongh-97930: Also include subdirectory in makefile. (python#102030)
  pythongh-99735: Use required=True in argparse subparsers example (python#100927)
  Fix incorrectly documented attribute in csv docs (python#101250)
  pythonGH-84783: Make the slice object hashable (pythonGH-101264)
  ...
carljm added a commit to carljm/cpython that referenced this issue Feb 22, 2023
* main: (225 commits)
  pythongh-102056: Fix a few bugs in error handling of exception printing code (python#102078)
  pythongh-102011: use sys.exception() instead of sys.exc_info() in docs where possible (python#102012)
  pythongh-101566: Sync with zipp 3.14. (pythonGH-102018)
  pythonGH-99818: improve the documentation for zipfile.Path and Traversable (pythonGH-101589)
  pythongh-88233: zipfile: handle extras after a zip64 extra (pythonGH-96161)
  pythongh-101981: Apply HOMEBREW related environment variables (pythongh-102074)
  pythongh-101907: Stop using `_Py_OPCODE` and `_Py_OPARG` macros (pythonGH-101912)
  pythongh-101819: Adapt _io types to heap types, batch 1 (pythonGH-101949)
  pythongh-101981: Build macOS as recommended by the devguide (pythonGH-102070)
  pythongh-97786: Fix compiler warnings in pytime.c (python#101826)
  pythongh-101578: Amend PyErr_{Set,Get}RaisedException docs (python#101962)
  Misc improvements to the float tutorial (pythonGH-102052)
  pythongh-85417: Clarify behaviour on branch cuts in cmath module (python#102046)
  pythongh-100425: Update tutorial docs related to sum() accuracy (FH-101854)
  Add missing 'is' to `cmath.log()` docstring (python#102049)
  pythongh-100210: Correct the comment link for unescaping HTML (python#100212)
  pythongh-97930: Also include subdirectory in makefile. (python#102030)
  pythongh-99735: Use required=True in argparse subparsers example (python#100927)
  Fix incorrectly documented attribute in csv docs (python#101250)
  pythonGH-84783: Make the slice object hashable (pythonGH-101264)
  ...
carljm added a commit to carljm/cpython that referenced this issue Feb 23, 2023
* main: (76 commits)
  Fix syntax error in struct doc example (python#102160)
  pythongh-99108: Import MD5 and SHA1 from HACL* (python#102089)
  pythonGH-101777: `queue.rst`: use 2 spaces after a period to be consistent. (python#102143)
  Few coverage nitpicks for the cmath module (python#102067)
  pythonGH-100982: Restrict `FOR_ITER_RANGE` to a single instruction to allow instrumentation. (pythonGH-101985)
  pythongh-102135: Update turtle docs to rename wikipedia demo to rosette (python#102137)
  pythongh-99942: python.pc on android/cygwin should link to libpython per configure.ac (pythonGH-100356)
  pythongh-95672 fix typo SkitTest to SkipTest (pythongh-102119)
  pythongh-101936: Update the default value of fp from io.StringIO to io.BytesIO (pythongh-102100)
  pythongh-102008: simplify test_except_star by using sys.exception() instead of sys.exc_info() (python#102009)
  pythongh-101903: Remove obsolete undefs for previously removed macros Py_EnterRecursiveCall and Py_LeaveRecursiveCall (python#101923)
  pythongh-100556: Improve clarity of `or` docs (python#100589)
  pythongh-101777: Make `PriorityQueue` docs slightly clearer (python#102026)
  pythongh-101965: Fix usage of Py_EnterRecursiveCall return value in _bisectmodule.c (pythonGH-101966)
  pythongh-101578: Amend exception docs (python#102057)
  pythongh-101961 fileinput.hookcompressed should not set the encoding value for the binary mode (pythongh-102068)
  pythongh-102056: Fix a few bugs in error handling of exception printing code (python#102078)
  pythongh-102011: use sys.exception() instead of sys.exc_info() in docs where possible (python#102012)
  pythongh-101566: Sync with zipp 3.14. (pythonGH-102018)
  pythonGH-99818: improve the documentation for zipfile.Path and Traversable (pythonGH-101589)
  ...
hauntsaninja pushed a commit to hauntsaninja/cpython that referenced this issue Feb 25, 2023
… Traversable (pythonGH-101589)

Automerge-Triggered-By: GH:FFY00
(cherry picked from commit 84181c1)

Co-authored-by: Filipe Laíns <[email protected]>
@hauntsaninja
Copy link
Contributor

I opened backport PRs #102266 and #102267 ; once merged I think we can close this

hauntsaninja added a commit to hauntsaninja/cpython that referenced this issue Feb 25, 2023
FFY00 added a commit that referenced this issue Feb 25, 2023
…rsable (GH-101589) (#102266)

Automerge-Triggered-By: GH:FFY00
(cherry picked from commit 84181c1)

Co-authored-by: Filipe Laíns <[email protected]>
@Stannislav
Copy link

Stumbled upon this discussion while dealing with a similar issue. So, maybe I can contribute a different angle on this.

file = importlib.resources.files("my.module").joinpath("some-file") is the recommended way of loading package resources.

Depending on the way the package is distributed, file can be of type pathlib.Path or zipfile.Path, which have non-compatible interfaces, as discussed in this thread. This inconsistency maybe be a stumbling block for those wishing to use importlib.

python-sidebar pushed a commit to python-sidebar/Python-Documentation-Fork-With-TOC that referenced this issue Sep 1, 2024
python-sidebar pushed a commit to python-sidebar/Python-Documentation-Fork-With-TOC that referenced this issue Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants