Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C implementation of os.path.splitroot() #102511

Closed
barneygale opened this issue Mar 7, 2023 · 14 comments
Closed

Add C implementation of os.path.splitroot() #102511

barneygale opened this issue Mar 7, 2023 · 14 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Mar 7, 2023

Feature or enhancement

Speed up os.path.splitroot() by implementing it in C.

Pitch

Per @eryksun:

I think splitroot() warrants a C implementation since it's a required step in our basic path handling on Windows -- join(), split(), relpath(), and commonpath(). Speeding it up gives a little boost across the board. Also, it would be less confusing if nt._path_splitroot() actually implemented ntpath.splitroot().

If implemented, using _Py_splitroot() in _Py_normpath() would help to ensure consistency. Currently, for example, ntpath.normpath('//?/UNC/server/share/../..') is correct on POSIX but wrong on Windows because _Py_normpath() incorrectly handles "//?/UNC/" as the root instead of "//?/UNC/server/share/".

Previous discussion

Linked PRs

@barneygale barneygale added the type-feature A feature request or enhancement label Mar 7, 2023
@barneygale barneygale added the performance Performance or resource usage label Mar 7, 2023
@zooba
Copy link
Member

zooba commented Mar 10, 2023

I'd recommend making this a skiproot API rather than split, and just have it return the first index (or a pointer) after the root. This is very easy to convert into a split, but it also composes better with other APIs (e.g. _Py_normpath needs the index and not the text).

@eryksun
Copy link
Contributor

eryksun commented Mar 10, 2023

Sounds good. We can use _Py_skiproot() in os__path_splitroot_impl() in "Modules/posixmodule.c".

Note that existing use of nt._path_splitroot() in importlib will have to be updated to handle the 3-way split (drive, root, rest) instead of (anchor, rest).

@barneygale
Copy link
Contributor Author

I'm going to attempt to implement this, though my C is rather basic so it will take me some time.

Rough plan:

  1. Make nt._path_splitroot() work like ntpath.splitroot(), i.e. return a 3-tuple. Adjust importlib. Use it in ntpath where available.
  2. Add posix._path_splitroot(). Use it in posixpath where available.
  3. Add _Py_skiproot(). Call it from _Py_normpath() and _path_splitroot()

Let me know if that doesn't sound right!

@eryksun
Copy link
Contributor

eryksun commented Mar 13, 2023

Here's an overview of what I did to experiment with this idea. I started by adding a new _Py_skiproot() internal API function in "Python/fileutils.c":

PyAPI_FUNC(wchar_t *) _Py_skiproot(wchar_t *path, Py_ssize_t size,
                                   wchar_t **root);

I moved the existing root-skipping implementation from _Py_normpath() into _Py_skiproot() and modified it to implement returning rest and root. I also added support for extended "UNC" paths. I updated _Py_normpath() to use _Py_skiproot(). Once test_ntpath and test_posixpath were successful, I moved on to implement os._path_splitroot_ex in "Modules/posixmodule.c", which I used to implement ntpath.splitroot() and posixpath.splitroot(). I again verified that test_ntpath and test_posixpath were successful.

I retained the old os._path_splitroot implementation that's used by importlib. I plan to modify importlib to use the new implementation, but only if benchmarking demonstrates that making splitroot() a builtin function is worthwhile in general. If so, I'll remove the old implementation and rename the new one to os._path_splitroot.

My initial comparison shows that, when splitting "//server/share/spam/eggs", the new builtin _path_splitroot_ex() is about 4 times faster than the fallback implementation on Windows, and it's about 1.7 times faster than the fallback implementation on Linux.

@barneygale
Copy link
Contributor Author

My C isn't good enough for this, so someone else please feel free to take a stab at it!

@zooba
Copy link
Member

zooba commented May 22, 2024

Reopening and assigning to myself because I'm holding up the backport until we have confidence in the change. My note from the PR:

I want to hold this for a bit until we're confident in the change in later versions. Virtually every time we touch this kind of code, we introduce new failures or vulnerabilities, so let's keep it restricted to prerelease versions for now.

I'd say after 3.13.0b3 is released we'll reconsider, but that's because I'm hopeful we'll get a decent amount of people trying out those betas. If it seems like nobody is using them, we'll push this further.

@zooba zooba reopened this May 22, 2024
zooba pushed a commit that referenced this issue Jun 3, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 3, 2024
(cherry picked from commit 4765e1f)

Co-authored-by: Nice Zombies <[email protected]>
zooba pushed a commit that referenced this issue Jun 3, 2024
(cherry picked from commit 4765e1f)

Co-authored-by: Nice Zombies <[email protected]>
mliezun pushed a commit to mliezun/cpython that referenced this issue Jun 3, 2024
barneygale pushed a commit to barneygale/cpython that referenced this issue Jun 5, 2024
@pradyunsg
Copy link
Member

pradyunsg commented Jun 12, 2024

FWIW, we're seeing this affect pip's test suite starting 3.13.0b2 -- based on the changelog, #118263 was the thing that switched stuff over to using the C implementation, but I'm making a comment here since it seems relevant to the C implementation being adopted overall (IIUC).

The specific behaviour change we're seeing is that UNC paths are normalised differently.

def path_to_url(path: str) -> str:
    """
    Convert a path to a file: URL.  The path will be made absolute and have
    quoted path parts.
    """
    path = os.path.normpath(os.path.abspath(path))
    url = urllib.parse.urljoin("file:", urllib.request.pathname2url(path))
    return url

On Windows with 3.13.0b2, we're seeing failures like the following in CI:

 >       assert path_to_url(r"\\unc\as\path") == "file://unc/as/path"
E       AssertionError: assert 'file:////unc/as/path' == 'file://unc/as/path'
E         
E         - file://unc/as/path
E         + file:////unc/as/path
E         ?      ++

This passes on 3.13.0b1, at least based on the last run yesterday (I don't have a windows dev machine handy). This might be a difference in behaviour between the Python implementation and C implementation.

@nineteendo
Copy link
Contributor

This pull request wasn't back ported to 3.12, but the bug also occurs there:

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import abspath, normpath
>>> from urllib.parse import urljoin
>>> from urllib.request import pathname2url
>>> path_to_url = lambda path: urljoin("file:", pathname2url(normpath(abspath(path))))
>>> path_to_url(r"\\unc\as\path")
'file:////unc/as/path'

@nineteendo
Copy link
Contributor

@pradyunsg, it's caused by #113563:

import urllib.parse
from urllib.parse import _coerce_args, urljoin, uses_netloc

def urlunsplit(components):
    scheme, netloc, url, query, fragment, _coerce_result = (_coerce_args(*components))
    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
        if url and url[:1] != '/': url = '/' + url
        url = '//' + (netloc or '') + url
    if scheme:
        url = scheme + ':' + url
    if query:
        url = url + '?' + query
    if fragment:
        url = url + '#' + fragment
    return _coerce_result(url)

print(urljoin("file:", "////unc/as/path")) # file:////unc/as/path
urllib.parse.urlunsplit = urlunsplit
print(urljoin("file:", "////unc/as/path")) # file://unc/as/path

@eryksun
Copy link
Contributor

eryksun commented Jun 12, 2024

Yes, in #113563, urllib.parse was switched from using 2-slash UNC paths to 4-slash UNC paths. Both forms are valid. The implementation now supports a roundtrip split/unsplit correctly for both forms. For example:

>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file:////unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'

Previously, this didn't work correctly for 4-slash paths:

>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'

@nineteendo
Copy link
Contributor

Eryk, is nturl2path.pathname2url() supposed to handle forward slashes?

>>> from nturl2path import pathname2url
>>> pathname2url('//unc/as/path')
'//unc/as/path'
>>> pathname2url(r'\\unc\as\path')
'////unc/as/path'

@eryksun
Copy link
Contributor

eryksun commented Jun 12, 2024

@nineteendo, I couldn't find an issue related to this. The standard library doesn't use urllib.request.pathname2url(), so it isn't currently an internal issue. However, the documentation claims to support "the local syntax for a path", which would include using forward slash as the path separator. The Windows implementation of pathname2url() is tested in "Lib/test/test_urllib.py", but AFAICT only with backslash as the path separator. I think the input path should be normalized via normpath() and then split via splitroot() to process it sanely.

Also, currently nothing else in the standard library uses the nturl2path module, and its existence isn't documented. The two functions could be defined directly in urllib.request. The way some of the tests are implemented would have to be updated as well.

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
sobolevn added a commit to sobolevn/cpython that referenced this issue Sep 15, 2024
zooba pushed a commit that referenced this issue Sep 18, 2024
savannahostrowski pushed a commit to savannahostrowski/cpython that referenced this issue Sep 22, 2024
JelleZijlstra pushed a commit to JelleZijlstra/cpython that referenced this issue Oct 3, 2024
…m `path` back to `p` (pythonGH-124097)

(cherry picked from commit 3b6bfa7)

Co-authored-by: sobolevn <[email protected]>
JelleZijlstra added a commit that referenced this issue Oct 8, 2024
…h` back to `p` (GH-124097) (#124919)

(cherry picked from commit 3b6bfa7)

Co-authored-by: sobolevn <[email protected]>
@erlend-aasland
Copy link
Contributor

Closing the issue as completed, as the OP feature request seems to have been implemented.

@nineteendo
Copy link
Contributor

Technically there's still #119394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants