Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pathlib.Path.glob() by avoiding repeated calls to os.path.normcase() #104104

Closed
barneygale opened this issue May 2, 2023 · 4 comments
Closed
Labels
3.12 bugs and security fixes performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented May 2, 2023

As part of removing "flavour" classes in #31691, I changed pathlib's glob() implementation: previously it used re.IGNORECASE to implement case-insensitive matches, whereas after it called os.path.normcase() on the pattern and the paths. The new behaviour is a little slower, and I think we should restore the previous implementation.

Linked PRs

@barneygale barneygale added the type-feature A feature request or enhancement label May 2, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 2, 2023
…calls to `os.path.normcase()`

Use `re.IGNORECASE` to implement case-insensitive matching. This
restores behaviour from before python#31691.
@barneygale barneygale added performance Performance or resource usage 3.12 bugs and security fixes topic-pathlib labels May 2, 2023
barneygale added a commit that referenced this issue May 2, 2023
…to `os.path.normcase()` (GH-104105)

Use `re.IGNORECASE` to implement case-insensitive matching. This
restores behaviour from before GH-31691.
carljm added a commit to carljm/cpython that referenced this issue May 2, 2023
* main: (760 commits)
  pythonGH-104102: Optimize `pathlib.Path.glob()` handling of `../` pattern segments (pythonGH-104103)
  pythonGH-104104: Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` (pythonGH-104105)
  pythongh-103822: [Calendar] change return value to enum for day and month APIs (pythonGH-103827)
  pythongh-65022: Fix description of tuple return value in copyreg (python#103892)
  pythonGH-103525: Improve exception message from `pathlib.PurePath()` (pythonGH-103526)
  pythongh-84436: Add integration C API tests for immortal objects (pythongh-103962)
  pythongh-103743: Add PyUnstable_Object_GC_NewWithExtraData (pythonGH-103744)
  pythongh-102997: Update Windows installer to SQLite 3.41.2. (python#102999)
  pythonGH-103484: Fix redirected permanently URLs (python#104001)
  Improve assert_type phrasing (python#104081)
  pythongh-102997: Update macOS installer to SQLite 3.41.2. (pythonGH-102998)
  pythonGH-103472: close response in HTTPConnection._tunnel (python#103473)
  pythongh-88496: IDLE - fix another test on macOS (python#104075)
  pythongh-94673: Hide Objects in PyTypeObject Behind Accessors (pythongh-104074)
  pythongh-94673: Properly Initialize and Finalize Static Builtin Types for Each Interpreter (pythongh-104072)
  pythongh-104016: Skip test for deeply neste f-strings on wasi (python#104071)
  pythongh-104057: Fix direct invocation of test_super (python#104064)
  pythongh-87092: Expose assembler to unit tests (python#103988)
  pythongh-97696: asyncio eager tasks factory (python#102853)
  pythongh-84436: Immortalize in _PyStructSequence_InitBuiltinWithFlags() (pythongh-104054)
  ...
@eryksun
Copy link
Contributor

eryksun commented May 3, 2023

Using re.IGNORECASE is wrong on Windows. Correct case-insensitive filename comparisons on Windows have to use os.path.normcase(). The implementation calls WinAPI LCMapStringEx() with the invariant locale to get a simple lowercase conversion of characters in the basic multilingual plane (BMP). This case mapping is always a one-to-one character mapping, and no mapping is implemented for characters beyond the BMP. This basically matches how filesystems on Windows implement case-insensitive name comparisons, except they use an uppercase conversion, which doesn't matter for checking equality. (It matters a bit for the sort order of filenames.)

For example, Python's lowercase mapping of "İ" (U+0130) is a two-character string:

>>> print(ascii('İ'.lower()))
'i\u0307'

To a Windows filesystem, "İ" and "i\u0307" are different filenames.

>>> open('i\u0307', 'w').close()
>>> open('\u0130', 'w').close()
>>> names = os.listdir()
>>> names
['i̇', 'İ']
>>> os.path.normcase(names[0])
'i̇'
>>> os.path.normcase(names[1])
'İ'
>>> os.path.normcase(names[0]) == os.path.normcase(names[1])
False

Using re.IGNORECASE yields a false positive match:

>>> re.match(names[1], names[0], re.IGNORECASE)
<re.Match object; span=(0, 1), match='i'>

@eryksun eryksun reopened this May 3, 2023
@barneygale
Copy link
Contributor Author

That bug existed in all previous versions of pathlib; I didn't intentionally fix it IIRC. Personally I consider it pretty minor. I'm hoping to add a case_sensitive argument to glob() in #102710, and so I'd rather keep the case normalization rules reasonably generic.

@eryksun
Copy link
Contributor

eryksun commented May 3, 2023

Okay, I didn't review the code in detail. I just wanted you to be aware that Windows paths should not be case-insensitively compared for equality using Python's lowercase mapping. The system's locale-invariant, non-linguistic case mapping has to be used when comparing paths.

@eryksun eryksun closed this as completed May 3, 2023
@eryksun
Copy link
Contributor

eryksun commented May 3, 2023

This case mapping is always a one-to-one character mapping, and no mapping is implemented for characters beyond the BMP.

Of course when I actually checked this just now I discovered a bug. When os.path.normcase() was modified to use WinAPI LCMapStringEx() instead of str.lower(), I only verified that it worked correctly for characters in the BMP and that using str.lower() was wrong for 200 characters in the BMP. Unfortunately the documented behavior of LCMapStringEx() is wrong for non-BMP characters. It's supposed to default to "file system" rules, for which the surrogate pair representation of a non-BMP character should be handled as two UCS-2 ordinals. But LCMapStringEx() actually implements a case mapping for 40 non-BMP characters:

>>> os.listdir()
[]
>>> chars = [c for i in range(65536, sys.maxunicode) if normcase(c:=chr(i)) != c]
>>> len(chars)
40
>>> [(c, normcase(c)) for c in chars]
[('𐐀', '𐐨'), ('𐐁', '𐐩'), ('𐐂', '𐐪'), ('𐐃', '𐐫'), ('𐐄', '𐐬'), ('𐐅', '𐐭'), ('𐐆', '𐐮'), ('𐐇', '𐐯'), ('𐐈',
 '𐐰'), ('𐐉', '𐐱'), ('𐐊', '𐐲'), ('𐐋', '𐐳'), ('𐐌', '𐐴'), ('𐐍', '𐐵'), ('𐐎', '𐐶'), ('𐐏', '𐐷'), ('𐐐', '𐐸'),
 ('𐐑', '𐐹'), ('𐐒', '𐐺'), ('𐐓', '𐐻'), ('𐐔', '𐐼'), ('𐐕', '𐐽'), ('𐐖', '𐐾'), ('𐐗', '𐐿'), ('𐐘', '𐑀'), ('𐐙',
 '𐑁'), ('𐐚', '𐑂'), ('𐐛', '𐑃'), ('𐐜', '𐑄'), ('𐐝', '𐑅'), ('𐐞', '𐑆'), ('𐐟', '𐑇'), ('𐐠', '𐑈'), ('𐐡', '𐑉'),
 ('𐐢', '𐑊'), ('𐐣', '𐑋'), ('𐐤', '𐑌'), ('𐐥', '𐑍'), ('𐐦', '𐑎'), ('𐐧', '𐑏')]

To the filesystem, each of these normalized names is unique:

>>> len(os.listdir())
0
>>> for c in chars: open(c, 'w').close(); open(normcase(c), 'w').close()
...
>>> len(os.listdir())
80

I don't know how to efficiently resolve this bug, short of calling a lower-level NTAPI function instead of LCMapStringEx(). I think we need a new path comparison function in os.path. On Windows, it can call CompareStringOrdinal() and default to a case-insensitive comparison. CompareStringOrdinal() uses the same uppercase mapping as the filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants