gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297

serhiy-storchaka · 2023-10-03T16:03:09Z

Issue: C API: Add PyUnicode_EqualToUTF8() function #110289

📚 Documentation preview 📚: https://cpython-previews--110297.org.readthedocs.build/

vstinner

PyUnicode_EqualToString() inline UTF-8 encoder is hard for review for me right now, I would feel more comfortable with tests, especially on corner cases:

string not encoded to UTF-8
Evil surrogate characters

Doc/c-api/unicode.rst

vstinner · 2023-10-03T16:50:00Z

Objects/unicodeobject.c

+    assert(str);
+    if (PyUnicode_IS_ASCII(unicode)) {
+        size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
+        return strlen(str) == len &&


I would prefer to test the length first, to make the code more readable.

Like:

if (strlen(str) == len) { return 1; } return memcmp(...);

Same below.

It is the same in _PyUnicode_EqualToASCIIString().

How

if (!a) { return 0; } return b;

is more readable than simple return a && b;? It is what the && operator for.

is more readable than simple return a && b;?

For me, it's easier to reason about a single test per line when I review code.

Keep a && b if you prefer.

The readability problem as I see it, is that your && use has side effects; it is not a pure logic expression.

For me, it's easier to reason about a single test per line when I review code.

Fortunately, every condition here is already on a separate line.

The readability problem as I see it, is that your && use has side effects; it is not a pure logic expression.

It is how && works in C. There is a lot of code like arg != NULL and PyDict_Check(arg) && PyDict_GET_SIZE(arg) > count. I do not think rewriting it in three ifs with gotos can improve readability.

vstinner · 2023-10-03T16:52:57Z

Suggestion for a different function name to avoid any confusion... and make it shorter: PyUnicode_EqualToUTF8().

serhiy-storchaka · 2023-10-03T18:26:07Z

I considered two variants: PyUnicode_EqualToUTF8String() and PyUnicode_EqualToString().

Doc/c-api/unicode.rst

vstinner · 2023-10-04T07:08:45Z

Doc/c-api/unicode.rst

   Compare a Unicode object with a UTF-8 encoded C string and return true
   if they are equal and false otherwise.


Suggested change

Compare a Unicode object with a UTF-8 encoded C string and return true

if they are equal and false otherwise.

Compare a Unicode object with a UTF-8 encoded C string and return non-zero

if they are equal or 0 otherwise.

It looks to me, that "return true" is more often used than "return non-zero". In this case it is more accurate, because it always returns 1, not other non-zero value. Perhaps other functions which return non-zero was a macro that returned not 1 (something like (arg->flags & FLAG))?

vstinner · 2023-10-04T07:09:41Z

Doc/whatsnew/3.13.rst

  a :c:expr:`const char*` UTF-8 encoded bytes string and return true if they
  are equal or false otherwise.


Suggested change

a :c:expr:`const char*` UTF-8 encoded bytes string and return true if they

are equal or false otherwise.

a :c:expr:`const char*` UTF-8 encoded bytes string and return non-zero if they

are equal or 0 otherwise.

Lib/test/test_capi/test_unicode.py

Doc/c-api/unicode.rst

vstinner · 2023-10-04T07:23:37Z

Objects/unicodeobject.c

+        }
+        else if (ch < 0x800) {
+            if ((0xc0 | (ch >> 6)) != (unsigned char)*str++ ||
+                (0x80 | (ch & 0x3f)) != (unsigned char)*str++)


unsigned char byte1 = (0xc0 | (ch >> 6)); unsigned char byte2 = (0x80 | (ch & 0x3f)); if (str[0] != byte1 || str[1] != byte2) return 0;

And declare a str variable as unsigned char* once to avoid casting str at each byte comparison.

If the first comparison fails, you do not need to calculate the second byte. The code looks more compact and uniform in the way it is written right now. All expressions I copied from the UTF-8 encoder which I wrote 11 years ago, so no need to recheck them. Casting to unsigned char is not a large burden, but if you prefer, I can introduce a new unsigned char* variable.

Co-authored-by: Victor Stinner <[email protected]>

Doc/c-api/unicode.rst

vstinner · 2023-10-04T08:44:57Z

Doc/c-api/unicode.rst

@@ -1396,6 +1396,18 @@ They all return ``NULL`` or ``-1`` if an exception occurs.
   :c:func:`PyErr_Occurred` to check for errors.


+.. c:function:: int PyUnicode_EqualToUTF8(PyObject *unicode, const char *string)
+
+   Compare a Unicode object with a UTF-8 encoded C string and return true (``1``)


Suggested change

Compare a Unicode object with a UTF-8 encoded C string and return true (``1``)

Compare a Unicode object with a UTF-8 encoded or ASCII encoding C string and return true (``1``)

Maybe "ASCII encoded"?

vstinner · 2023-10-04T08:45:47Z

Objects/unicodeobject.c

+    assert(str);
+    if (PyUnicode_IS_ASCII(unicode)) {
+        size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
+        return strlen(str) == len &&


is more readable than simple return a && b;?

For me, it's easier to reason about a single test per line when I review code.

Keep a && b if you prefer.

Lib/test/test_capi/test_unicode.py

Modules/_testcapi/unicode.c

erlend-aasland · 2023-10-04T08:53:17Z

Objects/unicodeobject.c

+    assert(str);
+    if (PyUnicode_IS_ASCII(unicode)) {
+        size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
+        return strlen(str) == len &&


The readability problem as I see it, is that your && use has side effects; it is not a pure logic expression.

serhiy-storchaka · 2023-10-04T13:28:13Z

I tried to rewrite the code in more vertically sparse way:

int
PyUnicode_EqualToUTF8(PyObject *unicode, const char *str)
{
    assert(_PyUnicode_CHECK(unicode));
    assert(str);

    if (PyUnicode_IS_ASCII(unicode)) {
        size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
        if (strlen(str) != len) {
            return 0;
        }
        if (memcmp(PyUnicode_1BYTE_DATA(unicode), str, len) != 0) {
            return 0;
        }
        return 1;
    }
    if (PyUnicode_UTF8(unicode) != NULL) {
        size_t len = (size_t)PyUnicode_UTF8_LENGTH(unicode);
        if (strlen(str) != len) {
            return 0;
        }
        if (memcmp(PyUnicode_UTF8(unicode), str, len) != 0) {
            return 0;
        }
        return 1;
    }

    const unsigned char *s = (const unsigned char *)str;
    Py_ssize_t len = PyUnicode_GET_LENGTH(unicode);
    int kind = PyUnicode_KIND(unicode);
    const void *data = PyUnicode_DATA(unicode);
    /* Compare Unicode string and UTF-8 string */
    for (Py_ssize_t i = 0; i < len; i++) {
        Py_UCS4 ch = PyUnicode_READ(kind, data, i);
        if (ch == 0) {
            return 0;
        }
        else if (ch < 0x80) {
            if (s[0] != ch) {
                return 0;
            }
            s += 1;
        }
        else if (ch < 0x800) {
            if (s[0] != (0xc0 | (ch >> 6))) {
                return 0;
            }
            if (s[1] != (0x80 | (ch & 0x3f))) {
                return 0;
            }
            s += 2;
        }
        else if (ch < 0x10000) {
            if (Py_UNICODE_IS_SURROGATE(ch)) {
                return 0;
            }
            if (s[0] != (0xe0 | (ch >> 12))) {
                return 0;
            }
            if (s[1] != (0x80 | ((ch >> 6) & 0x3f))) {
                return 0;
            }
            if (s[2] != (0x80 | (ch & 0x3f))) {
                return 0;
            }
            s += 3;
        }
        else {
            assert(ch <= MAX_UNICODE);
            if (s[0] != (0xf0 | (ch >> 18))) {
                return 0;
            }
            if (s[1] != (0x80 | ((ch >> 12) & 0x3f))) {
                return 0;
            }
            if (s[2] != (0x80 | ((ch >> 6) & 0x3f))) {
                return 0;
            }
            if (s[3] != (0x80 | (ch & 0x3f))) {
                return 0;
            }
            s += 4;
        }
    }
    return *s == 0;
}

and it causes dizziness and eye pain in me. It is physically difficult for me to read it.

Doc/c-api/unicode.rst

vstinner · 2023-10-04T13:55:14Z

Lib/test/test_capi/test_unicode.py

+        # CRASHES equaltoutf8(b'abc', b'abc')
+        # CRASHES equaltoutf8([], b'abc')
+        # CRASHES equaltoutf8(NULL, b'abc')
+        # CRASHES equaltoutf8('abc')  # NULL


Suggested change

# CRASHES equaltoutf8('abc') # NULL

# CRASHES equaltoutf8('abc', NULL)

No, it does not work so.

NULL is defined as None, and equaltoutf8('abc', None) is a TypeError.

If equaltoutf8() is called with only one argument, it sets the second argument for PyUnicode_EqualToUTF8() to NULL, so we can test it and ensure that it indeed crashes. It is a common approach used in other tests in this file for const char * argument. Some functions do not crash, but raise exception or return successfully for NULL, but this function simply crashes in debug build.

Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave # NULL if you prefer.

Hmm, I copied this pattern from the test for PyUnicode_CompareWithASCIIString() which was one of the first written tests. In newer tests I use "z#" which allows to pass None for NULL. Or perhaps I changed this everywhere except the test for PyUnicode_CompareWithASCIIString(). So perhaps I can change this too.

Lib/test/test_capi/test_unicode.py

Modules/_testcapi/unicode.c

Lib/test/test_capi/test_unicode.py

Doc/whatsnew/3.13.rst

Doc/c-api/unicode.rst

Co-authored-by: Victor Stinner <[email protected]>

vstinner · 2023-10-04T14:57:00Z

I prepared a PR to add this function to Python 2.7-3.12 in the pythoncapi-compat project: python/pythoncapi-compat#78

I chose to write a simple implementation:

        utf8 = PyUnicode_AsUTF8AndSize(unicode, &len);
        if (utf8 == NULL) {
            // Memory allocation failure. The API cannot report error,
            // so clear the exception and return 0.
            PyErr_Clear();
            return 0;
        }

It's tempting to ask you to modify the API to return -1 on error, but on the other side I hate APIs with simple tasks like "compare two strings" which can fail :-( Most people simply... don't check for errors.

So well, I like the propose API, function which cannot fail.

serhiy-storchaka · 2023-10-04T15:07:53Z

Oh, other features of this implementation:

It can be called when an error is set and preserves it.
It does not use heap, so it can be used when MemoryError has been raised.

vstinner

LGTM. I just left a few more minor comments.

Include/unicodeobject.h

vstinner · 2023-10-04T15:16:19Z

Lib/test/test_capi/test_unicode.py

+        # CRASHES equaltoutf8(b'abc', b'abc')
+        # CRASHES equaltoutf8([], b'abc')
+        # CRASHES equaltoutf8(NULL, b'abc')
+        # CRASHES equaltoutf8('abc')  # NULL


Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave # NULL if you prefer.

Modules/_testcapi/unicode.c

encukou · 2023-10-05T08:47:51Z

IMO we ned a general strategy around dealing with strings. Let's not solve just for PyUnicode_Equal, but design something that we'll also use for, say, dict and attribute lookup.

Having two functions, for for both zero-terminated str and for separate length argument, sounds good to me. And we also want a third one that takes PyUnicode. (Yes, in this case we have it already).

Which of those should be in what kind of C-API? Which should be in stable ABI, which can just be inline functions? What should the naming conventions be? Is the char* const? What's the thread safety strory?
Please delay merging until after the sprint -- I hope to come up with a proposal for how to answer questions like that, consistently.

vstinner · 2023-10-05T08:53:40Z

Which of those should be in what kind of C-API?

The 3 flavors should be exposed as regular function calls.

vstinner · 2023-10-05T13:19:35Z

This is 2023 and null-encoded C strings are definitely not a good idea for new C APIs.

Would you mind to elaborate why/how using null terminated C string became a bad thing in 2023?

pitrou · 2023-10-05T13:27:11Z

Would you mind to elaborate why/how using null terminated C string became a bad thing in 2023?

"Became a bad thing in 2023" is your interpretation. It has always been a design mistake, but it becomes even more glaring when interoperating with other languages which made the correct decision (that is, strings in those languages store their size explicitly).

In the distant times when the CPython C API was only called from C software, expecting null-terminated strings was fine, but it's not anymore.

vstinner · 2023-10-05T13:33:09Z

In the distant times when the CPython C API was only called from C software, expecting null-terminated strings was fine, but it's not anymore.

I don't thin that it's worth to argue. We should just add an API without size, and an API with a size. That's all.

The API without size is at least needed to upgrade all users of _PyUnicode_Equal() and _PyUnicode_EqualToASCIIId(), removed in Python 3.13.

pitrou · 2023-10-05T13:34:44Z

I don't thin that it's worth to argue.

Ah... I'm reassured, thank you.

vstinner · 2023-10-05T14:05:44Z

Oh, apparently this PR is now discussed at https://discuss.python.org/t/new-pyunicode-equaltoutf8-function/35377

davidhewitt · 2023-10-05T15:34:38Z

Objects/unicodeobject.c

+            s += 4;
+        }
+    }
+    return *s == 0;


I suppose that if we return true at this point then we know that str is the utf8 representation of unicode, does it make sense to copy the contents into unicode->utf8 so that future operations can fast-path without needing to encode again?

It needs a separate research and discussion. The disadvantage is that it increases the consumed memory size, also it consumes some CPU time, so the benefit will be only if the UTF-8 cache is used in future.

If the idea turned out to be good, it can simply be implemented in the future.

Makes total sense. I guess this also sits in an awkward place where it's likely that the user is best suited to know whether or not they want the utf-8 cache populated, but it's also an implementation detail that we don't really want to expose to users. ~~For now I'll just mark this comment as resolved.~~ Edit I can't, probably lack permissions I guess.

PyUnicode_EqualToUTF8() doesn't raise exception and cannot fail. Trying to allocate memory should not raise memory, but it sounds like a non-trivial side effect.

Worst case: 1 GB string, you call PyUnicode_EqualToUTF8() and suddenly, Python allocates 1 GB more. I would be surprised by this behavior.

Maybe it's worth it to add a comment explaining why we don't cache the UTF-8 encoded string.

Co-authored-by: Antoine Pitrou <[email protected]>

Doc/c-api/unicode.rst

vstinner · 2023-10-06T10:04:16Z

Doc/c-api/unicode.rst

@@ -1396,18 +1396,28 @@ They all return ``NULL`` or ``-1`` if an exception occurs.
   :c:func:`PyErr_Occurred` to check for errors.


-.. c:function:: int PyUnicode_EqualToUTF8(PyObject *unicode, const char *string)
+.. c:function:: int PyUnicode_EqualToUTF8AndSize(PyObject *unicode, const char *string, Py_ssize_t size)


What do you think about renaming string to utf8_str? The utf8_ would be another way to document that it's expected to be encoded to UTF-8 and also it's easier (for me) to distinguish that the second argument is a bytes string, since string name is quite generic.

It is a part of the bigger issue. See #62897.

Doc/c-api/unicode.rst

Lib/test/test_capi/test_unicode.py

Objects/unicodeobject.c

Co-authored-by: Victor Stinner <[email protected]>

vstinner

LGTM the updated PR which now also adds PyUnicode_EqualToUTF8AndSize(). You just have to fix the merge conflict.

So far, I didn't see any real blocker issue in the healthy discussion.

Lib/test/test_capi/test_unicode.py

Offending docstrings were removed; dismissing my request for changes

serhiy-storchaka · 2023-10-11T07:13:24Z

@pitrou, does it look good to you now?

Lib/test/test_capi/test_unicode.py

pitrou · 2023-10-11T08:21:54Z

Misc/stable_abi.toml

+[function.PyUnicode_EqualToUTF8]
+    added = '3.13'
+[function.PyUnicode_EqualToUTF8AndSize]
+    added = '3.13'


Unrelated question, but is there a plan to generate this file from Doc/data/stable_abi.dat or the reverse?

I think that Doc/data/stable_abi.dat is generated from Misc/stable_abi.toml.

Ah, ok, thank you!

vstinner · 2023-10-12T10:00:46Z

I added these functions to pythoncapi-compat: python/pythoncapi-compat@99ab0d3

…alToUTF8AndSize() functions (pythonGH-110297)

pythongh-110289: C API: Add PyUnicode_EqualToString() function

d39945e

serhiy-storchaka requested a review from vstinner October 3, 2023 16:03

bedevere-app bot mentioned this pull request Oct 3, 2023

C API: Add PyUnicode_EqualToUTF8() function #110289

Closed

vstinner reviewed Oct 3, 2023

View reviewed changes

serhiy-storchaka added 2 commits October 3, 2023 21:20

Add tests and address review comments.

4793161

Merge branch 'main' into capi-PyUnicode_EqualToString

8b24911

serhiy-storchaka changed the title ~~gh-110289: C API: Add PyUnicode_EqualToString() function~~ gh-110289: C API: Add PyUnicode_EqualToUTF8() function Oct 3, 2023

vstinner reviewed Oct 4, 2023

View reviewed changes

serhiy-storchaka and others added 2 commits October 4, 2023 10:53

Apply suggestions from code review

c55f9ac

Co-authored-by: Victor Stinner <[email protected]>

Address some of review comments and test the UTF-8 cache.

bdf2f1e

serhiy-storchaka marked this pull request as ready for review October 4, 2023 08:35

serhiy-storchaka requested review from a team and encukou as code owners October 4, 2023 08:35

bedevere-app bot added the awaiting review label Oct 4, 2023

vstinner reviewed Oct 4, 2023

View reviewed changes

erlend-aasland reviewed Oct 4, 2023

View reviewed changes

Address review comments.

7223c14

vstinner reviewed Oct 4, 2023

View reviewed changes

vstinner mentioned this pull request Oct 4, 2023

Add PyUnicode_EqualToUTF8() function python/pythoncapi-compat#78

Merged

Apply suggestions from code review

b271327

Co-authored-by: Victor Stinner <[email protected]>

Remove trailing spaces.

6f26ad6

vstinner approved these changes Oct 4, 2023

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Oct 4, 2023

davidhewitt reviewed Oct 5, 2023

View reviewed changes

serhiy-storchaka and others added 2 commits October 5, 2023 22:31

Apply suggestions from code review

ee5781d

Co-authored-by: Antoine Pitrou <[email protected]>

Add PyUnicode_EqualToUTF8AndSize().

1a4eb7b

vstinner reviewed Oct 6, 2023

View reviewed changes

serhiy-storchaka and others added 4 commits October 7, 2023 15:43

Apply suggestions from code review

b124377

Co-authored-by: Victor Stinner <[email protected]>

Add more parentheses.

029f1a0

Remove redundant arguments.

be2ffe8

Merge branch 'main' into capi-PyUnicode_EqualToString

29b26f7

vstinner approved these changes Oct 7, 2023

View reviewed changes

erlend-aasland previously requested changes Oct 10, 2023

View reviewed changes

Lib/test/test_capi/test_unicode.py Outdated Show resolved Hide resolved

Lib/test/test_capi/test_unicode.py Outdated Show resolved Hide resolved

serhiy-storchaka added 2 commits October 10, 2023 23:34

Turn docstrings into comments.

78de49d

Merge branch 'main' into capi-PyUnicode_EqualToString

fc79d5e

pitrou reviewed Oct 11, 2023

View reviewed changes

Lib/test/test_capi/test_unicode.py Show resolved Hide resolved

pitrou reviewed Oct 11, 2023

View reviewed changes

Lib/test/test_capi/test_unicode.py Show resolved Hide resolved

pitrou reviewed Oct 11, 2023

View reviewed changes

Add tests for empty strings.

19ad126

serhiy-storchaka merged commit eb50cd3 into python:main Oct 11, 2023
27 checks passed

bedevere-app bot removed the awaiting core review label Oct 11, 2023

serhiy-storchaka deleted the capi-PyUnicode_EqualToString branch October 11, 2023 13:42

Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024

pythongh-110289: C API: Add PyUnicode_EqualToUTF8() and PyUnicode_Equ…

ee0b11b

…alToUTF8AndSize() functions (pythonGH-110297)

		Compare a Unicode object with a UTF-8 encoded C string and return true
		if they are equal and false otherwise.

		a :c:expr:`const char*` UTF-8 encoded bytes string and return true if they
		are equal or false otherwise.

	# CRASHES equaltoutf8('abc') # NULL
	# CRASHES equaltoutf8('abc', NULL)

gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297

gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297

Conversation

serhiy-storchaka commented Oct 3, 2023 • edited by github-actions bot Loading

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Oct 3, 2023

serhiy-storchaka commented Oct 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka commented Oct 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Oct 4, 2023

serhiy-storchaka commented Oct 4, 2023

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

encukou commented Oct 5, 2023 • edited Loading

vstinner commented Oct 5, 2023

vstinner commented Oct 5, 2023

pitrou commented Oct 5, 2023

vstinner commented Oct 5, 2023

pitrou commented Oct 5, 2023

vstinner commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhewitt Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

serhiy-storchaka commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Oct 12, 2023

serhiy-storchaka commented Oct 3, 2023 •

edited by github-actions bot

Loading

encukou commented Oct 5, 2023 •

edited

Loading

davidhewitt Oct 5, 2023 •

edited

Loading