Skip to content

Commit

Permalink
pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL
Browse files Browse the repository at this point in the history
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
  • Loading branch information
vstinner committed Oct 20, 2023
1 parent b60f058 commit 4e0b3d3
Show file tree
Hide file tree
Showing 8 changed files with 49 additions and 25 deletions.
8 changes: 8 additions & 0 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -992,11 +992,19 @@ These are the UTF-8 codec APIs:
As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
Raise an exception if the *unicode* string contains embedded null
characters. To accept embedded null characters and truncate on purpose
at the first null byte, ``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be
used instead.
.. versionadded:: 3.3
.. versionchanged:: 3.7
The return type is now ``const char *`` rather of ``char *``.
.. versionchanged:: 3.13
Raise an exception if the string contains embedded null characters.
UTF-32 Codecs
"""""""""""""
Expand Down
6 changes: 6 additions & 0 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1109,6 +1109,12 @@ Porting to Python 3.13
are now undefined by ``<Python.h>``.
(Contributed by Victor Stinner in :gh:`85283`.)

* The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the string
contains embedded null characters. To accept embedded null characters and
truncate on purpose at the first null byte,
``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be used instead.
(Contributed by Victor Stinner in :gh:`111089`.)

Deprecated
----------

Expand Down
20 changes: 10 additions & 10 deletions Include/cpython/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -442,18 +442,18 @@ PyAPI_FUNC(PyObject*) PyUnicode_FromKindAndData(

/* --- Manage the default encoding ---------------------------------------- */

/* Returns a pointer to the default encoding (UTF-8) of the
Unicode object unicode.
Like PyUnicode_AsUTF8AndSize(), this also caches the UTF-8 representation
in the unicodeobject.
Use of this API is DEPRECATED since no size information can be
extracted from the returned data.
*/

// Returns a pointer to the default encoding (UTF-8) of the
// Unicode object unicode.
//
// Raise an exception if the string contains embedded null characters.
// Use PyUnicode_AsUTF8AndSize() to accept embedded null characters.
//
// This function caches the UTF-8 encoded string in the Unicode object
// and subsequent calls will return the same string. The memory is released
// when the Unicode object is deallocated.
PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);


/* === Characters Type APIs =============================================== */

/* These should not be used directly. Use the Py_UNICODE_IS* and
Expand Down
20 changes: 9 additions & 11 deletions Include/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -443,17 +443,15 @@ PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
PyObject *unicode /* Unicode object */
);

/* Returns a pointer to the default encoding (UTF-8) of the
Unicode object unicode and the size of the encoded representation
in bytes stored in *size.
In case of an error, no *size is set.
This function caches the UTF-8 encoded string in the unicodeobject
and subsequent calls will return the same string. The memory is released
when the unicodeobject is deallocated.
*/

// Returns a pointer to the default encoding (UTF-8) of the
// Unicode object unicode and the size of the encoded representation
// in bytes stored in `*size` (if size is not NULL).
//
// On error, `*size` is set to 0 (if size is not NULL).
//
// This function caches the UTF-8 encoded string in the Unicode object
// and subsequent calls will return the same string. The memory is released
// when the Unicode object is deallocated.
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030A0000
PyAPI_FUNC(const char *) PyUnicode_AsUTF8AndSize(
PyObject *unicode,
Expand Down
5 changes: 4 additions & 1 deletion Lib/test/test_capi/test_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -882,7 +882,10 @@ def test_asutf8(self):
self.assertEqual(unicode_asutf8('abc', 4), b'abc\0')
self.assertEqual(unicode_asutf8('абв', 7), b'\xd0\xb0\xd0\xb1\xd0\xb2\0')
self.assertEqual(unicode_asutf8('\U0001f600', 5), b'\xf0\x9f\x98\x80\0')
self.assertEqual(unicode_asutf8('abc\0def', 8), b'abc\0def\0')

# disallow embedded null characters
self.assertRaises(ValueError, unicode_asutf8, 'abc\0', 0)
self.assertRaises(ValueError, unicode_asutf8, 'abc\0def', 0)

self.assertRaises(UnicodeEncodeError, unicode_asutf8, '\ud8ff', 0)
self.assertRaises(TypeError, unicode_asutf8, b'abc', 0)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the
string contains embedded null characters. Patch by Victor Stinner.
5 changes: 3 additions & 2 deletions Objects/typeobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -3499,13 +3499,14 @@ type_new_set_doc(PyTypeObject *type)
return 0;
}

const char *doc_str = PyUnicode_AsUTF8(doc);
Py_ssize_t doc_size;
const char *doc_str = PyUnicode_AsUTF8AndSize(doc, &doc_size);
if (doc_str == NULL) {
return -1;
}

// Silently truncate the docstring if it contains a null byte
Py_ssize_t size = strlen(doc_str) + 1;
Py_ssize_t size = doc_size + 1;
char *tp_doc = (char *)PyObject_Malloc(size);
if (tp_doc == NULL) {
PyErr_NoMemory();
Expand Down
8 changes: 7 additions & 1 deletion Objects/unicodeobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -3837,7 +3837,13 @@ PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *psize)
const char *
PyUnicode_AsUTF8(PyObject *unicode)
{
return PyUnicode_AsUTF8AndSize(unicode, NULL);
Py_ssize_t size;
const char *utf8 = PyUnicode_AsUTF8AndSize(unicode, &size);
if (utf8 != NULL && strlen(utf8) != size) {
PyErr_SetString(PyExc_ValueError, "embedded null character");
return NULL;
}
return utf8;
}

/*
Expand Down

0 comments on commit 4e0b3d3

Please sign in to comment.