Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` #104104

barneygale · 2023-05-02T20:02:45Z

As part of removing "flavour" classes in #31691, I changed pathlib's glob() implementation: previously it used re.IGNORECASE to implement case-insensitive matches, whereas after it called os.path.normcase() on the pattern and the paths. The new behaviour is a little slower, and I think we should restore the previous implementation.

Linked PRs

GH-104104: Optimize pathlib.Path.glob() by avoiding repeated calls to os.path.normcase() #104105

The text was updated successfully, but these errors were encountered:

…calls to `os.path.normcase()` Use `re.IGNORECASE` to implement case-insensitive matching. This restores behaviour from before python#31691.

…to `os.path.normcase()` (GH-104105) Use `re.IGNORECASE` to implement case-insensitive matching. This restores behaviour from before GH-31691.

* main: (760 commits) pythonGH-104102: Optimize `pathlib.Path.glob()` handling of `../` pattern segments (pythonGH-104103) pythonGH-104104: Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` (pythonGH-104105) pythongh-103822: [Calendar] change return value to enum for day and month APIs (pythonGH-103827) pythongh-65022: Fix description of tuple return value in copyreg (python#103892) pythonGH-103525: Improve exception message from `pathlib.PurePath()` (pythonGH-103526) pythongh-84436: Add integration C API tests for immortal objects (pythongh-103962) pythongh-103743: Add PyUnstable_Object_GC_NewWithExtraData (pythonGH-103744) pythongh-102997: Update Windows installer to SQLite 3.41.2. (python#102999) pythonGH-103484: Fix redirected permanently URLs (python#104001) Improve assert_type phrasing (python#104081) pythongh-102997: Update macOS installer to SQLite 3.41.2. (pythonGH-102998) pythonGH-103472: close response in HTTPConnection._tunnel (python#103473) pythongh-88496: IDLE - fix another test on macOS (python#104075) pythongh-94673: Hide Objects in PyTypeObject Behind Accessors (pythongh-104074) pythongh-94673: Properly Initialize and Finalize Static Builtin Types for Each Interpreter (pythongh-104072) pythongh-104016: Skip test for deeply neste f-strings on wasi (python#104071) pythongh-104057: Fix direct invocation of test_super (python#104064) pythongh-87092: Expose assembler to unit tests (python#103988) pythongh-97696: asyncio eager tasks factory (python#102853) pythongh-84436: Immortalize in _PyStructSequence_InitBuiltinWithFlags() (pythongh-104054) ...

eryksun · 2023-05-03T01:27:06Z

Using re.IGNORECASE is wrong on Windows. Correct case-insensitive filename comparisons on Windows have to use os.path.normcase(). The implementation calls WinAPI LCMapStringEx() with the invariant locale to get a simple lowercase conversion of characters in the basic multilingual plane (BMP). This case mapping is always a one-to-one character mapping, and no mapping is implemented for characters beyond the BMP. This basically matches how filesystems on Windows implement case-insensitive name comparisons, except they use an uppercase conversion, which doesn't matter for checking equality. (It matters a bit for the sort order of filenames.)

For example, Python's lowercase mapping of "İ" (U+0130) is a two-character string:

>>> print(ascii('İ'.lower()))
'i\u0307'

To a Windows filesystem, "İ" and "i\u0307" are different filenames.

>>> open('i\u0307', 'w').close()
>>> open('\u0130', 'w').close()
>>> names = os.listdir()
>>> names
['i̇', 'İ']
>>> os.path.normcase(names[0])
'i̇'
>>> os.path.normcase(names[1])
'İ'
>>> os.path.normcase(names[0]) == os.path.normcase(names[1])
False

Using re.IGNORECASE yields a false positive match:

>>> re.match(names[1], names[0], re.IGNORECASE)
<re.Match object; span=(0, 1), match='i'>

barneygale · 2023-05-03T01:46:59Z

That bug existed in all previous versions of pathlib; I didn't intentionally fix it IIRC. Personally I consider it pretty minor. I'm hoping to add a case_sensitive argument to glob() in #102710, and so I'd rather keep the case normalization rules reasonably generic.

eryksun · 2023-05-03T05:03:54Z

Okay, I didn't review the code in detail. I just wanted you to be aware that Windows paths should not be case-insensitively compared for equality using Python's lowercase mapping. The system's locale-invariant, non-linguistic case mapping has to be used when comparing paths.

eryksun · 2023-05-03T22:36:55Z

This case mapping is always a one-to-one character mapping, and no mapping is implemented for characters beyond the BMP.

Of course when I actually checked this just now I discovered a bug. When os.path.normcase() was modified to use WinAPI LCMapStringEx() instead of str.lower(), I only verified that it worked correctly for characters in the BMP and that using str.lower() was wrong for 200 characters in the BMP. Unfortunately the documented behavior of LCMapStringEx() is wrong for non-BMP characters. It's supposed to default to "file system" rules, for which the surrogate pair representation of a non-BMP character should be handled as two UCS-2 ordinals. But LCMapStringEx() actually implements a case mapping for 40 non-BMP characters:

>>> os.listdir()
[]
>>> chars = [c for i in range(65536, sys.maxunicode) if normcase(c:=chr(i)) != c]
>>> len(chars)
40
>>> [(c, normcase(c)) for c in chars]
[('𐐀', '𐐨'), ('𐐁', '𐐩'), ('𐐂', '𐐪'), ('𐐃', '𐐫'), ('𐐄', '𐐬'), ('𐐅', '𐐭'), ('𐐆', '𐐮'), ('𐐇', '𐐯'), ('𐐈',
 '𐐰'), ('𐐉', '𐐱'), ('𐐊', '𐐲'), ('𐐋', '𐐳'), ('𐐌', '𐐴'), ('𐐍', '𐐵'), ('𐐎', '𐐶'), ('𐐏', '𐐷'), ('𐐐', '𐐸'),
 ('𐐑', '𐐹'), ('𐐒', '𐐺'), ('𐐓', '𐐻'), ('𐐔', '𐐼'), ('𐐕', '𐐽'), ('𐐖', '𐐾'), ('𐐗', '𐐿'), ('𐐘', '𐑀'), ('𐐙',
 '𐑁'), ('𐐚', '𐑂'), ('𐐛', '𐑃'), ('𐐜', '𐑄'), ('𐐝', '𐑅'), ('𐐞', '𐑆'), ('𐐟', '𐑇'), ('𐐠', '𐑈'), ('𐐡', '𐑉'),
 ('𐐢', '𐑊'), ('𐐣', '𐑋'), ('𐐤', '𐑌'), ('𐐥', '𐑍'), ('𐐦', '𐑎'), ('𐐧', '𐑏')]

To the filesystem, each of these normalized names is unique:

>>> len(os.listdir())
0
>>> for c in chars: open(c, 'w').close(); open(normcase(c), 'w').close()
...
>>> len(os.listdir())
80

I don't know how to efficiently resolve this bug, short of calling a lower-level NTAPI function instead of LCMapStringEx(). I think we need a new path comparison function in os.path. On Windows, it can call CompareStringOrdinal() and default to a case-insensitive comparison. CompareStringOrdinal() uses the same uppercase mapping as the filesystem.

barneygale added the type-feature A feature request or enhancement label May 2, 2023

barneygale added performance Performance or resource usage 3.12 bugs and security fixes topic-pathlib labels May 2, 2023

bedevere-bot mentioned this issue May 2, 2023

GH-104104: Optimize pathlib.Path.glob() by avoiding repeated calls to os.path.normcase() #104105

Merged

barneygale added a commit that referenced this issue May 2, 2023

GH-104104: Optimize pathlib.Path.glob() by avoiding repeated calls …

47770a1

…to `os.path.normcase()` (GH-104105) Use `re.IGNORECASE` to implement case-insensitive matching. This restores behaviour from before GH-31691.

barneygale closed this as completed May 2, 2023

eryksun reopened this May 3, 2023

eryksun closed this as completed May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` #104104

Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` #104104

barneygale commented May 2, 2023 •

edited by bedevere-bot

Loading

eryksun commented May 3, 2023 •

edited

Loading

barneygale commented May 3, 2023

eryksun commented May 3, 2023

eryksun commented May 3, 2023

Optimize pathlib.Path.glob() by avoiding repeated calls to os.path.normcase() #104104

Optimize pathlib.Path.glob() by avoiding repeated calls to os.path.normcase() #104104

Comments

barneygale commented May 2, 2023 • edited by bedevere-bot Loading

Linked PRs

eryksun commented May 3, 2023 • edited Loading

barneygale commented May 3, 2023

eryksun commented May 3, 2023

eryksun commented May 3, 2023

Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` #104104

Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` #104104

barneygale commented May 2, 2023 •

edited by bedevere-bot

Loading

eryksun commented May 3, 2023 •

edited

Loading