[Python-Dev] unicode vs buffer (array) design issue can crash interpreter (original) (raw)
Neal Norwitz nnorwitz at gmail.com
Thu Apr 13 07:13:51 CEST 2006
- Previous message: [Python-Dev] TODO Wiki (was: Preserving the blamelist)
- Next message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 3/31/06, M.-A. Lemburg <mal at egenix.com> wrote:
Martin v. Löwis wrote: > Neal Norwitz wrote: >> See http://python.org/sf/1454485 for the gory details. Basically if >> you create a unicode array (array.array('u')) and try to append an >> 8-bit string (ie, not unicode), you can crash the interpreter. >> >> The problem is that the string is converted without question to a >> unicode buffer. Within unicode, it assumes the data to be valid, but >> this isn't necessarily the case. We wind up accessing an array with a >> negative index and boom. > > There are several problems combined here, which might need discussion: > > - why does the 'u#' converter use the buffer interface if available? > it should just support Unicode objects. The buffer object makes > no promise that the buffer actually is meaningful UCS-2/UCS-4, so > u# shouldn't guess that it is. > (FWIW, it currently truncates the buffer size to the next-smaller > multiple of sizeof(PyUNICODE), and silently so) > > I think that part should just go: u# should be restricted to unicode > objects.
'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a PyUNICODE* buffer, hence the calculation of the length. However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc. So, I'm +1 on restricting 'u#' to Unicode objects.
Note: 2.5 no longer crashes, 2.4 does.
Does this mean you would like to see this patch checked in to 2.5? What should we do about 2.4?
Index: Python/getargs.c
--- Python/getargs.c (revision 45333) +++ Python/getargs.c (working copy) @@ -1042,11 +1042,8 @@ STORE_SIZE(PyUnicode_GET_SIZE(arg)); } else {
char *buf;
Py_ssize_t count = convertbuffer(arg, p, &buf);
if (count < 0)
return converterr(buf, arg, msgbuf, bufsize);
STORE_SIZE(count/(sizeof(Py_UNICODE)));
return converterr("cannot convert raw buffers"",
arg, msgbuf, bufsize); } format++; } else {
> - should Python guarantee that all characters in a Unicode object > are between 0 and sys.maxunicode? Currently, it is possible to > create Unicode strings with either negative or very large PyUNICODE > elements. > > - if the answer to the last question is no (i.e. if it is intentional > that a unicode object can contain arbitrary PyUNICODE values): should > Python then guarantee that PyUNICODE is an unsigned type?
PyUNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that PyUNICODE is always unsigned.
Martin fixed the crashing problem in 2.5 by making wchar_t unsigned which was a bug. (A configure test was reversed IIRC.) Can this change to wchar_t be made in 2.4? That technically changes all the interfaces even though it was a mistake. What should be done for 2.4?
n
- Previous message: [Python-Dev] TODO Wiki (was: Preserving the blamelist)
- Next message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]