-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-111089: Add cache to PyUnicode_AsUTF8() for embedded NUL #111587
Conversation
Since 2008,
Embedded null characters can lead to bugs, or worse, security vulnerabilities. I just modified PyUnicode_AsUTF8() in Python 3.13 to raise a ValueError if the string contains a null character: PR #111091. This change made the function safer and so I added it to the limited C API: PR #111121. Problem: @encukou has concerns about PyUnicode_AsUTF8() performance since strlen() has a O(n) complexity. I propose adding a 2-bit cache to PyASCIIObject to still keep PyUnicode_AsUTF8() safe and avoid whenever possible to call strlen() to check if a string contains null characters. With proposed change, PyUnicode_AsUTF8() doesn't have to call strlen() on strings created by PyUnicode_FromString(). Calling PyUnicode_FromString() is the most common way to create a Python str object in the Python C API. Moreover, in general, at maximum, PyUnicode_AsUTF8() only has to call strlen() once per Python str object, since the result is then stored in the cache. The change doesn't change IMO converting a Python str object to a C string as The Python C API has many other APIs to converting a Python str object to a C string, but usually, with a less convenient API, like return a bytes object which requires to get the pointer to the string, and get the size separately. In C, C strings terminated by a NUL byte is the most commonly used format for strings. The change initializes the cache for global static strings single as string singletons and the internal |
If the UTF-8 encoded string is under 100 bytes, |
I considered adding similar 2-bit field for embedded lone surrogates. UTF-16 and UTF-32 encoder can be faster if we know that the string do not contain them. It can extend the ASCII bit.
But I was not sure that it would justify the cost of additional complexity. |
I would prefer to not reuse the ASCII member to avoid any backward compatibility issue. If tomorrow, we are short in number of free bits, we can revisit the implementation, but it's not needed yet. embed_null=3 is invalid: _PyUnicode_CheckConsistency() fails in this case. It's a wasted "bit" but IMO it makes the API easier to use. I think that it's a good idea to add a cache for lone surrogates, it will be helpful on Windows which converts often to UTF-16. As I discovered while implementing this PR, sometimes you can fill the cache without having to compute anything. For example, PyUnicode_FromString() cannot produce lone surrogate: you can init the cache for free. |
@serhiy-storchaka: So, do you think that avoid strlen() at each PyUnicode_AsUTF8() is worth it? |
I am not sure. Every use of the But it increases the implementation complexity and makes the code more fragile. What happen when you modify the string in-place? For example by calling This flag could also be used in |
Only the first call. Then it's a O(1) operation when strlen() is omitted, no? |
What do you do with the |
Currently, cached PyUnicode_AsUTF8() has a complexity of O(n). With this change, cached PyUnicode_AsUTF8() has a complexity of O(1). On a 1 GiB string, it takes 3 ns instead of 64 ms (23,787,089x faster ;-)). Currently, (cached) PyUnicode_AsUTF8() performance has a complexity of O(n). I wrote a benchmark and in short, it's a benchmark on strlen(str) since the UTF-8 encoder is cached. Benchmark on cached PyUnicode_AsUTF8() comparing this PR to the main branch:
The benchmark warmup fills the PyUnicode_AsUTF8() cache, so the benchmark is on cached PyUnicode_AsUTF8(). Before, on a string of 1 GiB, strlen() took 64 ms. With this change, cached PyUnicode_AsUTF8() always take 3 ns, it has a complexity of O(1): it doesn't depending on the string length. bench.py: import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()
for text, size in (
('0', 0),
('1', 1),
('10', 10),
('10^3', 10**3),
('10^6', 10**6),
('1 MiB', 1024 ** 2),
('1 GiB', 1024 ** 3),
):
runner.bench_time_func(f'asutf8 {text}', functools.partial(bench_asutf8, 'x' * size)) Patch to add the benchmark function: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..6ccc89f28a 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,28 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
}
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+ PyObject *str;
+ Py_ssize_t loops;
+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+ return NULL;
+ }
+
+ _PyTime_t t = _PyTime_GetPerfCounter();
+ for (Py_ssize_t i=0; i < loops; i++) {
+ if (PyUnicode_AsUTF8(str) == NULL) {
+ return NULL;
+ }
+ }
+
+ _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
static PyMethodDef module_functions[] = {
{"get_configs", get_configs, METH_NOARGS},
{"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1723,7 @@ static PyMethodDef module_functions[] = {
{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS},
_TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
_TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+ {"bench_asutf8", bench_asutf8, METH_VARARGS},
{NULL, NULL} /* sentinel */
};
|
Would you mind to elaborate how/when the second (cached) call to PyUnicode_AsUTF8() on the same Python str object has a complexity of O(n) with my change? I'm not sure that we are talking about the same thing. |
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.
Fix unicode_subtype_new
830ffd5
to
e224751
Compare
I updated my PR to clear the cache when unicode_resize() is called. So indirectly, PyUnicode_Append() now clears the cache as well. I added an assertion to make sure that it's the case. In general, Python has low or no guarantee when a Python str object is modified. Functions to modify a string just check unicode_modifiable() which only has basic checks. For example, functions like PyUnicode_Fill() and PyUnicode_CopyCharacters() don't check if the UTF-8 encoded string cache is already filled. Only PyUnicode_Resize() clears the UTF-8 cache. In general, The PyUnicode API makes the assumption that these functions are only used to create a string, but then a string is no longer modified. (I would prefer to remove all C API which allows to modify a string, but that would be a backward incompatible change and require more work.)
I modified PyUnicode_AsWideCharString() in the first version of my change, but I reverted my change before creating this PR because of the Solaris code path (HAVE_NON_UNICODE_WCHAR_T_REPRESENTATION). Maybe I can propose a follow-up PR for PyUnicode_AsWideCharString() once this PR is merged. |
Suggestion by Serhiy
@serhiy-storchaka: I addressed your suggestions. Would you mind to review the updated PR? |
The benefits of the cache is minor (1 ns) for strings about 10 characters. It starts to become interesting for strings of at least 1,000 characters. |
Do you have any microbenchmarks that do not call BTW, I just removed 2 calls of |
This is nice, but how is it related to this change? There are usecases where you cannot avoid PyUnicode_AsUTF8(). Moreover, even when it's possible, it's not always trivial to avoid encode+decode roundtrip, especially if you are calling 3rd party code. |
UPDATE: Oops, I made a mistake the first time that I ran my benchmark. I forgot to rebuild Python, and so I measured twice the same binary. I reran the benchmark, and now I see a more significant difference with and without the change! Different benchmark on In short,
Result: the embed_null cache becomes very interesting starting at 10^6 characters (1.04x faster: 561 us = >542 us: saves 19 us). But the effect is visible starting at 0 characters (397 ns => 389 ns: save 8 ns) :-) IMO this benchmark is closer to a "real-world" benchmark that my first micro-benchmark on PyUnicode_AsUTF8() which was focused on the worst case. Benchmark ran with CPU isolation on Linux.
Patch: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..692d008cdf 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,44 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
}
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+ PyObject *str;
+ Py_ssize_t loops;
+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+ return NULL;
+ }
+
+ PyObject *codecs = PyImport_ImportModule("_codecs");
+ if (codecs == NULL) {
+ return NULL;
+ }
+ PyObject *lookup_error = PyObject_GetAttrString(codecs, "lookup_error");
+ Py_DECREF(codecs);
+ if (lookup_error == NULL) {
+ return NULL;
+ }
+
+ _PyTime_t t = _PyTime_GetPerfCounter();
+ for (Py_ssize_t i=0; i < loops; i++) {
+ PyObject *error = PyObject_CallOneArg(lookup_error, str);
+ if (error != NULL) {
+ Py_DECREF(error);
+ }
+ else {
+ PyErr_Clear();
+ }
+ }
+
+ _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+ Py_DECREF(lookup_error);
+
+ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
static PyMethodDef module_functions[] = {
{"get_configs", get_configs, METH_NOARGS},
{"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1739,7 @@ static PyMethodDef module_functions[] = {
{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS},
_TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
_TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+ {"bench_asutf8", bench_asutf8, METH_VARARGS},
{NULL, NULL} /* sentinel */
};
Script: import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()
for text, size in (
('0', 0),
('1', 1),
('10', 10),
('10^3', 10**3),
('10^6', 10**6),
('1 MiB', 1024 ** 2),
('1 GiB', 1024 ** 3),
):
runner.bench_time_func(f'_codecs.lookup_error size={text}', functools.partial(bench_asutf8, 'x' * size)) |
I reran PyUnicode_AsUTF8() micro-benchmark on the same machine but with CPU isolation:
Aha, I still see the major difference without/with the cache for 1 GiB string: 76.2 ms becomes 5.36 ns (14,220,978x faster). |
When PyUnicode_AsUTF8() doesn't need to encode the string to UTF-8, since data is already available (all ASCII strings!), and so "is O(1)", it's silly that PyUnicode_AsUTF8() wastes CPU cycles in calling strlen() and so becomes O(n) again in practice, no? |
I'm sorry, I made a mistake when running my benchmark on Reminder: I made PyUnicode_AsUTF8() slower in the main branch by adding strlen(). In Python 3.12, strlen() is not called: the function doesn't reject null characters. This change just reduces the slowdown compared to Python 3.12. |
When user get UTF-8 C string from Is there any valid O(1) use case of |
The elephant in the room: why isn't this deprecated with a deprecation warning? edit:
I don't know. Is |
Inada-san understand me.
It was made for correctness. If the
Even for ASCII strings you need O(n) to process them.
It shows that in the best possible example the difference is only few percents. Unlikely you can get better results with other real-world example. And megabyte-long error handler name is not very realistic.
Well, it saves 8 ns on every call of |
Sure, I understand that sometimes, the code processing PyUnicode_AsUTF8() result is way slower than strlen() and so the cache is not useful. There are various way to "use a string". Example from ctypes: switch (PyUnicode_AsUTF8(stgd->proto)[0]) {
case 'z': /* c_char_p */
case 'Z': /* c_wchar_p */ and: fmt = PyUnicode_AsUTF8(dict->proto);
fd = _ctypes_get_fielddesc(fmt);
struct fielddesc *
_ctypes_get_fielddesc(const char *fmt)
{
...
for (; table->code; ++table) {
if (table->code == fmt[0])
return table;
}
...
} A common pattern is to compare PyUnicode_AsUTF8() result using strcmp() to multiple strings. Example of _elementtree: const char *event_name = NULL;
if (PyUnicode_Check(event_name_obj)) {
event_name = PyUnicode_AsUTF8(event_name_obj);
} else if (PyBytes_Check(event_name_obj)) {
event_name = PyBytes_AS_STRING(event_name_obj);
}
...
if (strcmp(event_name, "start") == 0) {
...
} else if (strcmp(event_name, "end") == 0) {
...
} else if (strcmp(event_name, "start-ns") == 0) {
...
} else if (strcmp(event_name, "end-ns") == 0) {
...
... Another benchmark on strcmp() using _Py_GetErrorHandler(). Result:
Avoid strlen() in PyUnicode_AsUTF8() with cache of this PR saves 1.2 ns: 1.11x faster. _Py_GetErrorHandler(): _Py_error_handler
_Py_GetErrorHandler(const char *errors)
{
if (errors == NULL || strcmp(errors, "strict") == 0) {
return _Py_ERROR_STRICT;
}
if (strcmp(errors, "surrogateescape") == 0) {
return _Py_ERROR_SURROGATEESCAPE;
}
if (strcmp(errors, "replace") == 0) {
return _Py_ERROR_REPLACE;
}
...
} Patch: diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index a71e7e1dcc..27a9004a5c 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1639,6 +1639,40 @@ perf_trampoline_set_persist_after_fork(PyObject *self, PyObject *args)
}
+static PyObject *
+bench_asutf8(PyObject *self, PyObject *args)
+{
+ PyObject *str;
+ Py_ssize_t loops;
+ if (!PyArg_ParseTuple(args, "O!n", &PyUnicode_Type, &str, &loops)) {
+ return NULL;
+ }
+ Py_ssize_t found = 0;
+
+ _PyTime_t t = _PyTime_GetPerfCounter();
+ for (Py_ssize_t i=0; i < loops; i++) {
+ const char *utf8 = PyUnicode_AsUTF8(str);
+ if (utf8 == NULL) {
+ return NULL;
+ }
+ _Py_error_handler error_handler = _Py_GetErrorHandler(utf8);
+ if (error_handler == _Py_ERROR_STRICT) {
+ found++;
+ }
+ }
+
+ _PyTime_t dt = _PyTime_GetPerfCounter() - t;
+
+ // Cannot happen, just to make sure that the compiler don't remove
+ //_Py_GetErrorHandler() call.
+ if (found > loops) {
+ PyErr_NoMemory();
+ }
+
+ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(dt));
+}
+
+
static PyMethodDef module_functions[] = {
{"get_configs", get_configs, METH_NOARGS},
{"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1701,6 +1735,7 @@ static PyMethodDef module_functions[] = {
{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS},
_TESTINTERNALCAPI_WRITE_UNRAISABLE_EXC_METHODDEF
_TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
+ {"bench_asutf8", bench_asutf8, METH_VARARGS},
{NULL, NULL} /* sentinel */
}; Script: import pyperf
from _testinternalcapi import bench_asutf8
import functools
runner = pyperf.Runner()
runner.bench_time_func(f'asutf8', functools.partial(bench_asutf8, 'strict')) |
I created this PR to reply to @encukou's concern about performance overhead of adding strlen() to PyUnicode_AsUTF8(): #111089 If we consider that the overhead is not significant and this cache has too many drawbacks compared to advantages, I'm fine to not add it. |
You should maybe ask to the author who added the comment. Maybe the issue was not important enough to "implement" a deprecation. But well, PyUnicode_AsUTF8() now checks for embedded null characters, and so it's no longer needed to deprecate it. It seems like the majority of code dev who reviewed my recent changes about PyUnicode_AsUTF8() are fine with the new behavior. It also allowed to use PyUnicode_AsUTF8() in places where PyUnicode_AsUTF8AndSize() was used before to check for embedded null characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve because it looks technically correct to me.
I don't know if these changes should be made. The benefit is small in most real examples.
I don't know why I have to point this out, but the cache only amortises the cost when repeatedly calling |
Strings created by PyUnicode_FromString() and "static strings" are created with the cache initialized and so PyUnicode_AsUUTF8() never call strlen() on these strings. As I wrote before, PyUnicode_FromString() is one of the most common way to create a Python str object in the C API. |
* unicode_char() * PyUnicode_FromWideChar(str, -1) * _PyUnicode_Copy()
I tried to write the smallest PR just to add the member, but there is room for enhancements: set embed_null cache on newly created strings in more cases, with no performance overhead (without having to call strlen()). In short, all operations modifying an existing string can set embed_null if embed_null of the modified string is known (is 0 or 1). Tell me if you prefer that I make this PR more complete and attempt to set embed_null in every single case. |
@@ -2232,6 +2240,7 @@ _PyUnicode_Copy(PyObject *unicode) | |||
|
|||
memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode), | |||
length * PyUnicode_KIND(unicode)); | |||
_PyUnicode_STATE(copy).embed_null = _PyUnicode_STATE(unicode).embed_null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyUnicode_Copy()
makes a modifiable unicode object. It is legal to embed a null character in it after creation or replace an embeded null character with non-null character. In particular, it creates a new copy even from Latin1 character singletons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, technically, any string can be modified anytime by the C API. People "should not do that", but since it's possible, I'm not sure if it's safe to make the assumption that people will not mutate a string long after its creation: after the cache is initialized.
The trend of #111089 is more about reverting While the design of this cache is appealing to me, it seems like there are multiple issues:
If tomorrow, the public C API no longer allows to mutate a string, we can reconsider the cache idea. But for now, it seems like it's safer to not take the risk of adding a cache which can introduce bugs. Correctness matters more than performance here. I close my PR. Thanks everybody for looking into my cache ;-) It was a constructive discussion. |
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes embed_null since the string cannot contain a null character.
Global static strings now also initialize the embed_null member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.