gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587

vstinner · 2023-11-01T02:30:10Z

Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes embed_null since the string cannot contain a null character.

Global static strings now also initialize the embed_null member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.

Issue: [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

vstinner · 2023-11-01T02:49:26Z

PyUnicode_AsUTF8() is the only API which converts directly a Python str object to a C null terminated string, C type char*. Well, there is also PyUnicode_AsUTF8AndSize(str, &size) but it returns the size, unless PyUnicode_AsUTF8AndSize(str, NULL) is used on purpose to ignore null characters.

Since 2008, PyUnicode_AsUTF8() contains the following scary warning:

Use of this API is DEPRECATED since no size information can be extracted from the returned data.

Embedded null characters can lead to bugs, or worse, security vulnerabilities.

I just modified PyUnicode_AsUTF8() in Python 3.13 to raise a ValueError if the string contains a null character: PR #111091. This change made the function safer and so I added it to the limited C API: PR #111121.

Problem: @encukou has concerns about PyUnicode_AsUTF8() performance since strlen() has a O(n) complexity.

I propose adding a 2-bit cache to PyASCIIObject to still keep PyUnicode_AsUTF8() safe and avoid whenever possible to call strlen() to check if a string contains null characters.

With proposed change, PyUnicode_AsUTF8() doesn't have to call strlen() on strings created by PyUnicode_FromString(). Calling PyUnicode_FromString() is the most common way to create a Python str object in the Python C API. Moreover, in general, at maximum, PyUnicode_AsUTF8() only has to call strlen() once per Python str object, since the result is then stored in the cache.

The change doesn't change PyASCIIObject size, since there were 24 unused bits in PyASCIIObject.state: I just picked 2 bits for PyUnicode_AsUTF8().

IMO converting a Python str object to a C string as char* is an operation common enough to justify this change. For example, on PyPI top 5,000 projects, I found 587 lines containing PyUnicode_AsUTF8() in 131 projects (C extensions).

The Python C API has many other APIs to converting a Python str object to a C string, but usually, with a less convenient API, like return a bytes object which requires to get the pointer to the string, and get the size separately. In C, C strings terminated by a NUL byte is the most commonly used format for strings.

The change initializes the cache for global static strings single as string singletons and the internal _Py_STR() API. Only the chr(0) singleton ("\0" string) contains a NUL character.

vstinner · 2023-11-01T02:55:38Z

cc @methane @serhiy-storchaka @encukou

vstinner · 2023-11-01T03:24:10Z

If the UTF-8 encoded string is under 100 bytes, strlen() should not be expensive. But for strings way longer like 1 MB or 50 MB, calling strlen() may be significant. I'm not sure how to measure that in practice.

serhiy-storchaka · 2023-11-01T09:33:46Z

I considered adding similar 2-bit field for embedded lone surrogates. UTF-16 and UTF-32 encoder can be faster if we know that the string do not contain them. It can extend the ASCII bit.

ASCII, no lone surrogates
non-ASCII, no lone surrogates
non-ASCII, has lone surrogates
non-ASCII, lone surrogates not checked yet

But I was not sure that it would justify the cost of additional complexity.

vstinner · 2023-11-01T09:58:57Z

It can extend the ASCII bit.

I would prefer to not reuse the ASCII member to avoid any backward compatibility issue. If tomorrow, we are short in number of free bits, we can revisit the implementation, but it's not needed yet.

embed_null=3 is invalid: _PyUnicode_CheckConsistency() fails in this case. It's a wasted "bit" but IMO it makes the API easier to use.

I think that it's a good idea to add a cache for lone surrogates, it will be helpful on Windows which converts often to UTF-16. As I discovered while implementing this PR, sometimes you can fill the cache without having to compute anything. For example, PyUnicode_FromString() cannot produce lone surrogate: you can init the cache for free.

vstinner · 2023-11-01T09:59:44Z

@serhiy-storchaka: So, do you think that avoid strlen() at each PyUnicode_AsUTF8() is worth it?

Include/cpython/unicodeobject.h

Objects/unicodeobject.c

serhiy-storchaka · 2023-11-01T11:52:05Z

@serhiy-storchaka: So, do you think that avoid strlen() at each PyUnicode_AsUTF8() is worth it?

I am not sure. Every use of the PyUnicode_AsUTF8() result has at least O(n) complexity, so additional strlen() does not change the computational complexity, it only change the main multiplier in O(n), and may be not much.

But it increases the implementation complexity and makes the code more fragile. What happen when you modify the string in-place? For example by calling PyUnicode_Append() or PyUnicode_Resize()?

This flag could also be used in PyUnicode_AsWideCharString(). Even with this, I am not sure that the benefit is large enough.

vstinner · 2023-11-01T11:57:23Z

Every use of the PyUnicode_AsUTF8() result has at least O(n) complexity

Only the first call. Then it's a O(1) operation when strlen() is omitted, no?

serhiy-storchaka · 2023-11-01T12:15:35Z

Only the first call. Then it's a O(1) operation when strlen() is omitted, no?

What do you do with the PyUnicode_AsUTF8() result that is not O(n)?

vstinner · 2023-11-01T21:23:47Z

Currently, cached PyUnicode_AsUTF8() has a complexity of O(n). With this change, cached PyUnicode_AsUTF8() has a complexity of O(1). On a 1 GiB string, it takes 3 ns instead of 64 ms (23,787,089x faster ;-)).

Currently, (cached) PyUnicode_AsUTF8() performance has a complexity of O(n). I wrote a benchmark and in short, it's a benchmark on strlen(str) since the UTF-8 encoder is cached.

Benchmark on cached PyUnicode_AsUTF8() comparing this PR to the main branch:

+----------------+---------+------------------------------+
| Benchmark      | ref     | change                       |
+================+=========+==============================+
| asutf8 0       | 3.84 ns | 3.03 ns: 1.27x faster        |
+----------------+---------+------------------------------+
| asutf8 1       | 3.92 ns | 3.07 ns: 1.28x faster        |
+----------------+---------+------------------------------+
| asutf8 10      | 3.71 ns | 2.86 ns: 1.29x faster        |
+----------------+---------+------------------------------+
| asutf8 10^3    | 9.83 ns | 2.79 ns: 3.52x faster        |
+----------------+---------+------------------------------+
| asutf8 10^6    | 16.1 us | 2.71 ns: 5935.09x faster     |
+----------------+---------+------------------------------+
| asutf8 1 MiB   | 16.7 us | 2.69 ns: 6203.66x faster     |
+----------------+---------+------------------------------+
| asutf8 1 GiB   | 64.3 ms | 2.70 ns: 23787088.60x faster |
+----------------+---------+------------------------------+
| Geometric mean | (ref)   | 181.43x faster               |
+----------------+---------+------------------------------+

The benchmark warmup fills the PyUnicode_AsUTF8() cache, so the benchmark is on cached PyUnicode_AsUTF8().

Before, on a string of 1 GiB, strlen() took 64 ms. With this change, cached PyUnicode_AsUTF8() always take 3 ns, it has a complexity of O(1): it doesn't depending on the string length.

bench.py:

import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()

for text, size in (
    ('0', 0),
    ('1', 1),
    ('10', 10),
    ('10^3', 10**3),
    ('10^6', 10**6),
    ('1 MiB', 1024 ** 2),
    ('1 GiB', 1024 ** 3),
):
    runner.bench_time_func(f'asutf8 {text}', functools.partial(bench_asutf8, 'x' * size))

Patch to add the benchmark function:

diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..6ccc89f28a 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,28 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
 }
 
 
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+    PyObject *str;
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+        return NULL;
+    }
+
+    _PyTime_t t = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        if (PyUnicode_AsUTF8(str) == NULL) {
+            return NULL;
+        }
+    }
+
+    _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
 static PyMethodDef module_functions[] = {
     {"get_configs", get_configs, METH_NOARGS},
     {"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1723,7 @@ static PyMethodDef module_functions[] = {
     {"restore_crossinterp_data", restore_crossinterp_data,       METH_VARARGS},
     _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
     _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+    {"bench_asutf8", bench_asutf8, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

vstinner · 2023-11-01T21:27:08Z

What do you do with the PyUnicode_AsUTF8() result that is not O(n)?

Would you mind to elaborate how/when the second (cached) call to PyUnicode_AsUTF8() on the same Python str object has a complexity of O(n) with my change?

I'm not sure that we are talking about the same thing.

Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.

Fix unicode_subtype_new

vstinner · 2023-11-01T22:32:01Z

But it increases the implementation complexity and makes the code more fragile. What happen when you modify the string in-place? For example by calling PyUnicode_Append() or PyUnicode_Resize()?

I updated my PR to clear the cache when unicode_resize() is called. So indirectly, PyUnicode_Append() now clears the cache as well. I added an assertion to make sure that it's the case.

In general, Python has low or no guarantee when a Python str object is modified. Functions to modify a string just check unicode_modifiable() which only has basic checks. For example, functions like PyUnicode_Fill() and PyUnicode_CopyCharacters() don't check if the UTF-8 encoded string cache is already filled. Only PyUnicode_Resize() clears the UTF-8 cache.

In general, The PyUnicode API makes the assumption that these functions are only used to create a string, but then a string is no longer modified.

(I would prefer to remove all C API which allows to modify a string, but that would be a backward incompatible change and require more work.)

This flag could also be used in PyUnicode_AsWideCharString(). Even with this, I am not sure that the benefit is large enough.

I modified PyUnicode_AsWideCharString() in the first version of my change, but I reverted my change before creating this PR because of the Solaris code path (HAVE_NON_UNICODE_WCHAR_T_REPRESENTATION). Maybe I can propose a follow-up PR for PyUnicode_AsWideCharString() once this PR is merged.

Suggestion by Serhiy

vstinner · 2023-11-01T22:41:00Z

@serhiy-storchaka: I addressed your suggestions. Would you mind to review the updated PR?

vstinner · 2023-11-01T22:53:36Z

+----------------+---------+------------------------------+
| asutf8 10      | 3.71 ns | 2.86 ns: 1.29x faster        |
+----------------+---------+------------------------------+
| asutf8 10^3    | 9.83 ns | 2.79 ns: 3.52x faster        |
+----------------+---------+------------------------------+

The benefits of the cache is minor (1 ns) for strings about 10 characters. It starts to become interesting for strings of at least 1,000 characters.

serhiy-storchaka · 2023-11-02T07:42:21Z

I'm not sure that we are talking about the same thing.

PyUnicode_AsUTF8() itself can be O(1), but what do you do with the result? Any processing of the returned null-terminated C string is O(n), and it is usually much slower than strlen().

Do you have any microbenchmarks that do not call PyUnicode_AsUTF8() directly, but only use other functions that call it indirectly?

BTW, I just removed 2 calls of PyUnicode_AsUTF8() in #111602. The UTF-8 representations were only created to be copied and decoded back. There are more similar cases.

vstinner · 2023-11-02T08:57:01Z

BTW, I just removed 2 calls of PyUnicode_AsUTF8() in #111602. The UTF-8 representations were only created to be copied and decoded back. There are more similar cases.

This is nice, but how is it related to this change? There are usecases where you cannot avoid PyUnicode_AsUTF8(). Moreover, even when it's possible, it's not always trivial to avoid encode+decode roundtrip, especially if you are calling 3rd party code.

vstinner · 2023-11-02T09:39:39Z

UPDATE: Oops, I made a mistake the first time that I ran my benchmark. I forgot to rebuild Python, and so I measured twice the same binary. I reran the benchmark, and now I see a more significant difference with and without the change!

Different benchmark on _codecs.lookup_error() function.

In short, _codecs.lookup_error() calls:

PyUnicode_FromString(PyUnicode_AsUTF8(obj))
PyDict_GetItemRef()

PyUnicode_FromString(PyUnicode_AsUTF8(obj)) is the kind of roundtrip that @serhiy-storchaka was talking about :-) The problem is that Python C API only provides PyObject *PyCodec_LookupError(const char *name) which expects a UTF-8 encoded string.

Result: the embed_null cache becomes very interesting starting at 10^6 characters (1.04x faster: 561 us = >542 us: saves 19 us). But the effect is visible starting at 0 characters (397 ns => 389 ns: save 8 ns) :-)

IMO this benchmark is closer to a "real-world" benchmark that my first micro-benchmark on PyUnicode_AsUTF8() which was focused on the worst case.

Benchmark ran with CPU isolation on Linux.

vstinner@mona$ env/bin/python -m pyperf compare_to ref.json change.json --table
+---------------------------------+----------+------------------------+
| Benchmark                       | ref      | change                 |
+=================================+==========+========================+
| _codecs.lookup_error size=0     | 397 ns   | 389 ns: 1.02x faster   |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=1     | 414 ns   | 408 ns: 1.01x faster   |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=10    | 512 ns   | 505 ns: 1.01x faster   |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=10^3  | 1.70 us  | 1.71 us: 1.01x slower  |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=10^6  | 561 us   | 542 us: 1.04x faster   |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=1 MiB | 585 us   | 564 us: 1.04x faster   |
+---------------------------------+----------+------------------------+
| _codecs.lookup_error size=1 GiB | 1.20 sec | 1.15 sec: 1.05x faster |
+---------------------------------+----------+------------------------+
| Geometric mean                  | (ref)    | 1.02x faster           |
+---------------------------------+----------+------------------------+

Patch:

diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..692d008cdf 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,44 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
 }
 
 
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+    PyObject *str;
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+        return NULL;
+    }
+
+    PyObject *codecs = PyImport_ImportModule("_codecs");
+    if (codecs == NULL) {
+        return NULL;
+    }
+    PyObject *lookup_error = PyObject_GetAttrString(codecs, "lookup_error");
+    Py_DECREF(codecs);
+    if (lookup_error == NULL) {
+        return NULL;
+    }
+
+    _PyTime_t t = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *error = PyObject_CallOneArg(lookup_error, str);
+        if (error != NULL) {
+            Py_DECREF(error);
+        }
+        else {
+            PyErr_Clear();
+        }
+    }
+
+    _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+    Py_DECREF(lookup_error);
+
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
 static PyMethodDef module_functions[] = {
     {"get_configs", get_configs, METH_NOARGS},
     {"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1739,7 @@ static PyMethodDef module_functions[] = {
     {"restore_crossinterp_data", restore_crossinterp_data,       METH_VARARGS},
     _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
     _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+    {"bench_asutf8", bench_asutf8, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Script:

import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()

for text, size in (
    ('0', 0),
    ('1', 1),
    ('10', 10),
    ('10^3', 10**3),
    ('10^6', 10**6),
    ('1 MiB', 1024 ** 2),
    ('1 GiB', 1024 ** 3),
):
    runner.bench_time_func(f'_codecs.lookup_error size={text}', functools.partial(bench_asutf8, 'x' * size))

vstinner · 2023-11-02T09:59:14Z

I reran PyUnicode_AsUTF8() micro-benchmark on the same machine but with CPU isolation:

vstinner@mona$ env/bin/python -m pyperf compare_to ref.json change.json --table
+----------------+---------+------------------------------+
| Benchmark      | ref     | change                       |
+================+=========+==============================+
| asutf8 0       | 6.60 ns | 5.35 ns: 1.23x faster        |
+----------------+---------+------------------------------+
| asutf8 1       | 6.58 ns | 5.36 ns: 1.23x faster        |
+----------------+---------+------------------------------+
| asutf8 10      | 6.58 ns | 5.37 ns: 1.23x faster        |
+----------------+---------+------------------------------+
| asutf8 10^3    | 17.2 ns | 5.36 ns: 3.21x faster        |
+----------------+---------+------------------------------+
| asutf8 10^6    | 30.9 us | 5.36 ns: 5773.36x faster     |
+----------------+---------+------------------------------+
| asutf8 1 MiB   | 32.4 us | 5.37 ns: 6034.71x faster     |
+----------------+---------+------------------------------+
| asutf8 1 GiB   | 76.2 ms | 5.36 ns: 14220977.88x faster |
+----------------+---------+------------------------------+
| Geometric mean | (ref)   | 162.20x faster               |
+----------------+---------+------------------------------+

Aha, I still see the major difference without/with the cache for 1 GiB string: 76.2 ms becomes 5.36 ns (14,220,978x faster).

vstinner · 2023-11-02T10:23:03Z

PyUnicode_AsUTF8() itself can be O(1), but what do you do with the result?

When PyUnicode_AsUTF8() doesn't need to encode the string to UTF-8, since data is already available (all ASCII strings!), and so "is O(1)", it's silly that PyUnicode_AsUTF8() wastes CPU cycles in calling strlen() and so becomes O(n) again in practice, no?

vstinner · 2023-11-02T10:26:39Z

I'm sorry, I made a mistake when running my benchmark on _codecs.lookup_error(). In fact, the cache has a significant effect on any string length, but the speed-up depends on the length: it's between 1.01x and 1.05x faster. In terms of absolute timing, we talking about saving from 6 ns to 50 ms. Saving 50 ms is quite appealing.

Reminder: I made PyUnicode_AsUTF8() slower in the main branch by adding strlen(). In Python 3.12, strlen() is not called: the function doesn't reject null characters. This change just reduces the slowdown compared to Python 3.12.

methane · 2023-11-02T10:36:32Z

When PyUnicode_AsUTF8() doesn't need to encode the string to UTF-8, since data is already available (all ASCII strings!), and so "is O(1)", it's silly that PyUnicode_AsUTF8() wastes CPU cycles in calling strlen() and so becomes O(n) again in practice, no?

When user get UTF-8 C string from PyUnicode_AsUTF8(), they often to use O(n) for processing the string. (e.g. printf()). That is what @serhiy-storchaka said.

Is there any valid O(1) use case of PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() for 1MiB string?

encukou · 2023-11-02T11:02:58Z

Use of this API is DEPRECATED since no size information can be extracted from the returned data.

The elephant in the room: why isn't this deprecated with a deprecation warning?

edit:

Is there any valid O(1) use case of PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() for 1MiB string?

I don't know. Is PyUnicode_AsUTF8 used for long strings that are known to not include embedded zeros? How can we tell?

serhiy-storchaka · 2023-11-02T11:09:57Z

Inada-san understand me.

Reminder: I made PyUnicode_AsUTF8() slower in the main branch by adding strlen().

It was made for correctness. If the strlen() call is redundant, use PyUnicode_AsUTF8AndSize().

When PyUnicode_AsUTF8() doesn't need to encode the string to UTF-8, since data is already available (all ASCII strings!), and so "is O(1)", it's silly that PyUnicode_AsUTF8() wastes CPU cycles in calling strlen() and so becomes O(n) again in practice, no?

Even for ASCII strings you need O(n) to process them.

Result: the embed_null cache becomes very interesting starting at 10^6 characters (1.04x faster: 561 us = >542 us: saves 19 us).

It shows that in the best possible example the difference is only few percents. Unlikely you can get better results with other real-world example. And megabyte-long error handler name is not very realistic.

But the effect is visible starting at 0 characters (397 ns => 389 ns: save 8 ns) :-)

Well, it saves 8 ns on every call of PyUnicode_AsUTF8(). But if it adds even 1 ns to every creation of a string (assigning a field is not free), and every resizing of a string, the net difference in the actual program is likely to be the other way.

vstinner · 2023-11-02T14:02:54Z

Sure, I understand that sometimes, the code processing PyUnicode_AsUTF8() result is way slower than strlen() and so the cache is not useful.

There are various way to "use a string". Example from ctypes:

        switch (PyUnicode_AsUTF8(stgd->proto)[0]) {
        case 'z': /* c_char_p */
        case 'Z': /* c_wchar_p */

and:

fmt = PyUnicode_AsUTF8(dict->proto);
fd = _ctypes_get_fielddesc(fmt);

struct fielddesc *
_ctypes_get_fielddesc(const char *fmt)
{
    ...
    for (; table->code; ++table) {
        if (table->code == fmt[0])
            return table;
    }
    ...
}

A common pattern is to compare PyUnicode_AsUTF8() result using strcmp() to multiple strings. Example of _elementtree:

        const char *event_name = NULL;
        if (PyUnicode_Check(event_name_obj)) {
            event_name = PyUnicode_AsUTF8(event_name_obj);
        } else if (PyBytes_Check(event_name_obj)) {
            event_name = PyBytes_AS_STRING(event_name_obj);
        }
        ...
        if (strcmp(event_name, "start") == 0) {
            ...
        } else if (strcmp(event_name, "end") == 0) {
            ...
        } else if (strcmp(event_name, "start-ns") == 0) {
            ...
        } else if (strcmp(event_name, "end-ns") == 0) {
            ...
        ...

Another benchmark on strcmp() using _Py_GetErrorHandler().

Result:

$ ./env/bin/python -m pyperf compare_to ref.json change.json 
Mean +- std dev: [ref] 12.1 ns +- 0.3 ns -> [change] 10.9 ns +- 0.0 ns: 1.11x faster

Avoid strlen() in PyUnicode_AsUTF8() with cache of this PR saves 1.2 ns: 1.11x faster.

_Py_GetErrorHandler():

_Py_error_handler
_Py_GetErrorHandler(const char *errors)
{
    if (errors == NULL || strcmp(errors, "strict") == 0) {
        return _Py_ERROR_STRICT;
    }
    if (strcmp(errors, "surrogateescape") == 0) {
        return _Py_ERROR_SURROGATEESCAPE;
    }
    if (strcmp(errors, "replace") == 0) {
        return _Py_ERROR_REPLACE;
    }
    ...
}

Patch:

diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..27a9004a5c 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,40 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
 }
 
 
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+    PyObject *str;
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+        return NULL;
+    }
+    Py_ssize_t found = 0;
+
+    _PyTime_t t = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        const char *utf8 = PyUnicode_AsUTF8(str);
+        if (utf8 == NULL) {
+            return NULL;
+        }
+        _Py_error_handler error_handler = _Py_GetErrorHandler(utf8);
+        if (error_handler == _Py_ERROR_STRICT) {
+            found++;
+        }
+    }
+
+    _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+    // Cannot happen, just to make sure that the compiler don't remove
+    //_Py_GetErrorHandler() call.
+    if (found > loops) {
+        PyErr_NoMemory();
+    }
+
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
 static PyMethodDef module_functions[] = {
     {"get_configs", get_configs, METH_NOARGS},
     {"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1735,7 @@ static PyMethodDef module_functions[] = {
     {"restore_crossinterp_data", restore_crossinterp_data,       METH_VARARGS},
     _TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
     _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+    {"bench_asutf8", bench_asutf8, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Script:

import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()
runner.bench_time_func(f'asutf8', functools.partial(bench_asutf8, 'strict'))

vstinner · 2023-11-02T14:04:43Z

@methane:

When user get UTF-8 C string from PyUnicode_AsUTF8(), they often to use O(n) for processing the string. (e.g. printf()). That is what @serhiy-storchaka said.
Is there any valid O(1) use case of PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() for 1MiB string?

I created this PR to reply to @encukou's concern about performance overhead of adding strlen() to PyUnicode_AsUTF8(): #111089

If we consider that the overhead is not significant and this cache has too many drawbacks compared to advantages, I'm fine to not add it.

vstinner · 2023-11-02T14:08:41Z

Use of this API is DEPRECATED since no size information can be extracted from the returned data.

@encukou:

The elephant in the room: why isn't this deprecated with a deprecation warning?

You should maybe ask to the author who added the comment. Maybe the issue was not important enough to "implement" a deprecation.

But well, PyUnicode_AsUTF8() now checks for embedded null characters, and so it's no longer needed to deprecate it.

It seems like the majority of code dev who reviewed my recent changes about PyUnicode_AsUTF8() are fine with the new behavior. It also allowed to use PyUnicode_AsUTF8() in places where PyUnicode_AsUTF8AndSize() was used before to check for embedded null characters.

serhiy-storchaka

I approve because it looks technically correct to me.

I don't know if these changes should be made. The benefit is small in most real examples.

Yhg1s · 2023-11-02T20:57:25Z

I don't know why I have to point this out, but the cache only amortises the cost when repeatedly calling PyUnicode_AsUTF8() on the same string. Most of my uses of PyUnicode_AsUTF8 where performance matters will only call the function once, precisely because performance matters, so caching the result will have no benefit.

vstinner · 2023-11-02T21:00:28Z

I don't know why I have to point this out, but the cache only amortises the cost when repeatedly calling PyUnicode_AsUTF8() on the same string.

Strings created by PyUnicode_FromString() and "static strings" are created with the cache initialized and so PyUnicode_AsUUTF8() never call strlen() on these strings. As I wrote before, PyUnicode_FromString() is one of the most common way to create a Python str object in the C API.

* unicode_char() * PyUnicode_FromWideChar(str, -1) * _PyUnicode_Copy()

vstinner · 2023-11-02T21:17:30Z

I tried to write the smallest PR just to add the member, but there is room for enhancements: set embed_null cache on newly created strings in more cases, with no performance overhead (without having to call strlen()). In short, all operations modifying an existing string can set embed_null if embed_null of the modified string is known (is 0 or 1). Tell me if you prefer that I make this PR more complete and attempt to set embed_null in every single case.

serhiy-storchaka · 2023-11-03T08:38:55Z

Objects/unicodeobject.c

@@ -2232,6 +2240,7 @@ _PyUnicode_Copy(PyObject *unicode)

    memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
              length * PyUnicode_KIND(unicode));
+    _PyUnicode_STATE(copy).embed_null = _PyUnicode_STATE(unicode).embed_null;


_PyUnicode_Copy() makes a modifiable unicode object. It is legal to embed a null character in it after creation or replace an embeded null character with non-null character. In particular, it creates a new copy even from Latin1 character singletons.

Well, technically, any string can be modified anytime by the C API. People "should not do that", but since it's possible, I'm not sure if it's safe to make the assumption that people will not mutate a string long after its creation: after the cache is initialized.

vstinner · 2023-11-03T10:53:25Z

The trend of #111089 is more about reverting PyUnicode_AsUTF8() to no longer call strlen(). So PyUnicode_AsUTF8() performance is no longer an issue, and this cache becomes way less relevant.

While the design of this cache is appealing to me, it seems like there are multiple issues:

My benchmarks didn't convince @methane nor @serhiy-storchaka: the cache speedup is not significant compared to any O(n) operating using PyUnicode_AsUTF8() result.
Filling the cache in advance was the killer feature of my cache, but the public C API allows to mutate an immutable string anytime, which makes the cache inconsistent. Invalidating the cache would require slow operation such as strlen() which would defeat the whole purpose of the cache.

If tomorrow, the public C API no longer allows to mutate a string, we can reconsider the cache idea. But for now, it seems like it's safer to not take the risk of adding a cache which can introduce bugs. Correctness matters more than performance here.

I close my PR. Thanks everybody for looking into my cache ;-) It was a constructive discussion.

bedevere-app bot mentioned this pull request Nov 1, 2023

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed

bedevere-app bot added the awaiting core review label Nov 1, 2023

serhiy-storchaka self-requested a review November 1, 2023 10:06

serhiy-storchaka reviewed Nov 1, 2023

View reviewed changes

Include/cpython/unicodeobject.h Show resolved Hide resolved

serhiy-storchaka reviewed Nov 1, 2023

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

vstinner added 5 commits November 1, 2023 23:22

fixup! pythongh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL

a7e93c9

Fix unicode_subtype_new

Fix _PyUnicode_CheckConsistency() in release mode

4ccd7d9

Add constant

3c4844f

unicode_resize() clears embed_null cache

e224751

vstinner force-pushed the unicode_embed_null branch from 830ffd5 to e224751 Compare November 1, 2023 22:23

vstinner added 2 commits November 1, 2023 23:35

Add What's New entry

65c6671

Make the fast path faster.

30bb725

Suggestion by Serhiy

Revert test unicode_fromstring() change

07975be

vstinner mentioned this pull request Nov 2, 2023

gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL #111091

Merged

serhiy-storchaka approved these changes Nov 2, 2023

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Nov 2, 2023

Set embed_null in more cases on new strings

e3c6fa5

* unicode_char() * PyUnicode_FromWideChar(str, -1) * _PyUnicode_Copy()

serhiy-storchaka reviewed Nov 3, 2023

View reviewed changes

vstinner closed this Nov 3, 2023

vstinner deleted the unicode_embed_null branch November 3, 2023 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587

gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587

vstinner commented Nov 1, 2023 •

edited by bedevere-app bot

Loading

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023 •

edited

Loading

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

methane commented Nov 2, 2023

encukou commented Nov 2, 2023 •

edited

Loading

serhiy-storchaka commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

serhiy-storchaka left a comment

Yhg1s commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

serhiy-storchaka Nov 3, 2023

vstinner Nov 3, 2023

vstinner commented Nov 3, 2023

gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587

gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587

Conversation

vstinner commented Nov 1, 2023 • edited by bedevere-app bot Loading

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

vstinner commented Nov 1, 2023

serhiy-storchaka commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023 • edited Loading

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

methane commented Nov 2, 2023

encukou commented Nov 2, 2023 • edited Loading

serhiy-storchaka commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Yhg1s commented Nov 2, 2023

vstinner commented Nov 2, 2023

vstinner commented Nov 2, 2023

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

vstinner Nov 3, 2023

Choose a reason for hiding this comment

vstinner commented Nov 3, 2023

vstinner commented Nov 1, 2023 •

edited by bedevere-app bot

Loading

vstinner commented Nov 2, 2023 •

edited

Loading

encukou commented Nov 2, 2023 •

edited

Loading