(original) (raw)

changeset: 96042:863f7c57081b parent: 96040:1e1bb3eb6f93 parent: 96041:99d2f83290c0 user: R David Murray rdmurray@bitdance.com date: Wed May 13 20:32:19 2015 -0400 files: Doc/c-api/unicode.rst description: Merge: #23088: Clarify null termination of bytes and strings in C API. diff -r 1e1bb3eb6f93 -r 863f7c57081b Doc/c-api/bytearray.rst --- a/Doc/c-api/bytearray.rst Wed May 13 14:39:35 2015 -0700 +++ b/Doc/c-api/bytearray.rst Wed May 13 20:32:19 2015 -0400 @@ -64,7 +64,8 @@ .. c:function:: char* PyByteArray_AsString(PyObject *bytearray) Return the contents of *bytearray* as a char array after checking for a - *NULL* pointer. + *NULL* pointer. The returned array always has an extra + null byte appended. .. c:function:: int PyByteArray_Resize(PyObject *bytearray, Py_ssize_t len) diff -r 1e1bb3eb6f93 -r 863f7c57081b Doc/c-api/bytes.rst --- a/Doc/c-api/bytes.rst Wed May 13 14:39:35 2015 -0700 +++ b/Doc/c-api/bytes.rst Wed May 13 20:32:19 2015 -0400 @@ -69,8 +69,8 @@ +===================+===============+================================+ | :attr:`%%` | *n/a* | The literal % character. | +-------------------+---------------+--------------------------------+ - | :attr:`%c` | int | A single character, | - | | | represented as an C int. | + | :attr:`%c` | int | A single byte, | + | | | represented as a C int. | +-------------------+---------------+--------------------------------+ | :attr:`%d` | int | Exactly equivalent to | | | | ``printf("%d")``. | @@ -109,7 +109,7 @@ +-------------------+---------------+--------------------------------+ An unrecognized format character causes all the rest of the format string to be - copied as-is to the result string, and any extra arguments discarded. + copied as-is to the result object, and any extra arguments discarded. .. c:function:: PyObject* PyBytes_FromFormatV(const char *format, va_list vargs) @@ -136,11 +136,13 @@ .. c:function:: char* PyBytes_AsString(PyObject *o) - Return a NUL-terminated representation of the contents of *o*. The pointer - refers to the internal buffer of *o*, not a copy. The data must not be - modified in any way, unless the string was just created using + Return a pointer to the contents of *o*. The pointer + refers to the internal buffer of *o*, which consists of ``len(o) + 1`` + bytes. The last byte in the buffer is always null, regardless of + whether there are any other null bytes. The data must not be + modified in any way, unless the object was just created using ``PyBytes_FromStringAndSize(NULL, size)``. It must not be deallocated. If - *o* is not a string object at all, :c:func:`PyBytes_AsString` returns *NULL* + *o* is not a bytes object at all, :c:func:`PyBytes_AsString` returns *NULL* and raises :exc:`TypeError`. @@ -151,16 +153,18 @@ .. c:function:: int PyBytes_AsStringAndSize(PyObject *obj, char **buffer, Py_ssize_t *length) - Return a NUL-terminated representation of the contents of the object *obj* + Return the null-terminated contents of the object *obj* through the output variables *buffer* and *length*. - If *length* is *NULL*, the resulting buffer may not contain NUL characters; + If *length* is *NULL*, the bytes object + may not contain embedded null bytes; if it does, the function returns ``-1`` and a :exc:`TypeError` is raised. - The buffer refers to an internal string buffer of *obj*, not a copy. The data - must not be modified in any way, unless the string was just created using + The buffer refers to an internal buffer of *obj*, which includes an + additional null byte at the end (not counted in *length*). The data + must not be modified in any way, unless the object was just created using ``PyBytes_FromStringAndSize(NULL, size)``. It must not be deallocated. If - *string* is not a string object at all, :c:func:`PyBytes_AsStringAndSize` + *obj* is not a bytes object at all, :c:func:`PyBytes_AsStringAndSize` returns ``-1`` and raises :exc:`TypeError`. @@ -168,14 +172,14 @@ Create a new bytes object in *\*bytes* containing the contents of *newpart* appended to *bytes*; the caller will own the new reference. The reference to - the old value of *bytes* will be stolen. If the new string cannot be + the old value of *bytes* will be stolen. If the new object cannot be created, the old reference to *bytes* will still be discarded and the value of *\*bytes* will be set to *NULL*; the appropriate exception will be set. .. c:function:: void PyBytes_ConcatAndDel(PyObject **bytes, PyObject *newpart) - Create a new string object in *\*bytes* containing the contents of *newpart* + Create a new bytes object in *\*bytes* containing the contents of *newpart* appended to *bytes*. This version decrements the reference count of *newpart*. diff -r 1e1bb3eb6f93 -r 863f7c57081b Doc/c-api/unicode.rst --- a/Doc/c-api/unicode.rst Wed May 13 14:39:35 2015 -0700 +++ b/Doc/c-api/unicode.rst Wed May 13 20:32:19 2015 -0400 @@ -227,7 +227,10 @@ const char* PyUnicode_AS_DATA(PyObject *o) Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The - ``AS_DATA`` form casts the pointer to :c:type:`const char *`. *o* has to be + returned buffer is always terminated with an extra null code point. It + may also contain embedded null code points, which would cause the string + to be truncated when used in most C functions. The ``AS_DATA`` form + casts the pointer to :c:type:`const char *`. The *o* argument has to be a Unicode object (not checked). .. versionchanged:: 3.3 @@ -650,7 +653,8 @@ Copy the string *u* into a new UCS4 buffer that is allocated using :c:func:`PyMem_Malloc`. If this fails, *NULL* is returned with a - :exc:`MemoryError` set. + :exc:`MemoryError` set. The returned buffer always has an extra + null code point appended. .. versionadded:: 3.3 @@ -689,8 +693,9 @@ Return a read-only pointer to the Unicode object's internal :c:type:`Py_UNICODE` buffer, or *NULL* on error. This will create the :c:type:`Py_UNICODE*` representation of the object if it is not yet - available. Note that the resulting :c:type:`Py_UNICODE` string may contain - embedded null characters, which would cause the string to be truncated when + available. The buffer is always terminated with an extra null code point. + Note that the resulting :c:type:`Py_UNICODE` string may also contain + embedded null code points, which would cause the string to be truncated when used in most C functions. Please migrate to using :c:func:`PyUnicode_AsUCS4`, @@ -708,8 +713,9 @@ .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` - array length in *size*. Note that the resulting :c:type:`Py_UNICODE*` string - may contain embedded null characters, which would cause the string to be + array length (excluding the extra null terminator) in *size*. + Note that the resulting :c:type:`Py_UNICODE*` string + may contain embedded null code points, which would cause the string to be truncated when used in most C functions. .. versionadded:: 3.3 @@ -717,11 +723,11 @@ .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode) - Create a copy of a Unicode string ending with a nul character. Return *NULL* + Create a copy of a Unicode string ending with a null code point. Return *NULL* and raise a :exc:`MemoryError` exception on memory allocation failure, otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may - contain embedded null characters, which would cause the string to be + contain embedded null code points, which would cause the string to be truncated when used in most C functions. .. versionadded:: 3.2 @@ -902,10 +908,10 @@ Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing - 0-termination character). Return the number of :c:type:`wchar_t` characters + null termination character). Return the number of :c:type:`wchar_t` characters copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t*` - string may or may not be 0-terminated. It is the responsibility of the caller - to make sure that the :c:type:`wchar_t*` string is 0-terminated in case this is + string may or may not be null-terminated. It is the responsibility of the caller + to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is required by the application. Also, note that the :c:type:`wchar_t*` string might contain null characters, which would cause the string to be truncated when used with most C functions. @@ -914,8 +920,8 @@ .. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) Convert the Unicode object to a wide character string. The output string - always ends with a nul character. If *size* is not *NULL*, write the number - of wide characters (excluding the trailing 0-termination character) into + always ends with a null character. If *size* is not *NULL*, write the number + of wide characters (excluding the trailing null termination character) into *\*size*. Returns a buffer allocated by :c:func:`PyMem_Alloc` (use @@ -1045,9 +1051,11 @@ .. c:function:: char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) - Return a pointer to the default encoding (UTF-8) of the Unicode object, and - store the size of the encoded representation (in bytes) in *size*. *size* - can be *NULL*, in this case no size will be stored. + Return a pointer to the UTF-8 encoding of the Unicode object, and + store the size of the encoded representation (in bytes) in *size*. The + *size* argument can be *NULL*; in this case no size will be stored. The + returned buffer always has an extra null byte appended (not included in + *size*), regardless of whether there are any other null code points. In the case of an error, *NULL* is returned with an exception set and no *size* is stored. /rdmurray@bitdance.com