[Python-Dev] unicode vs buffer (array) design issue can crash interpreter (original) (raw)
M.-A. Lemburg mal at egenix.com
Thu Apr 13 12:20:49 CEST 2006
- Previous message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Next message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Neal Norwitz wrote:
On 3/31/06, M.-A. Lemburg <mal at egenix.com> wrote:
Martin v. Löwis wrote:
Neal Norwitz wrote:
See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter.
The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom. There are several problems combined here, which might need discussion: - why does the 'u#' converter use the buffer interface if available? it should just support Unicode objects. The buffer object makes no promise that the buffer actually is meaningful UCS-2/UCS-4, so u# shouldn't guess that it is. (FWIW, it currently truncates the buffer size to the next-smaller multiple of sizeof(PyUNICODE), and silently so) I think that part should just go: u# should be restricted to unicode objects. 'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a PyUNICODE* buffer, hence the calculation of the length. However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc. So, I'm +1 on restricting 'u#' to Unicode objects. Note: 2.5 no longer crashes, 2.4 does. Does this mean you would like to see this patch checked in to 2.5?
Yes.
What should we do about 2.4?
Perhaps you could add a warning that is displayed when using u# with non-Unicode objects ?!
Index: Python/getargs.c =================================================================== --- Python/getargs.c (revision 45333) +++ Python/getargs.c (working copy) @@ -1042,11 +1042,8 @@ STORESIZE(PyUnicodeGETSIZE(arg)); } else { - char *buf; - Pyssizet count = convertbuffer(arg, p, &buf); - if (count < 0) - return converterr(buf, arg, msgbuf, bufsize); - STORESIZE(count/(sizeof(PyUNICODE))); + return converterr("cannot convert raw buffers"", + arg, msgbuf, bufsize); } format++; } else {
- should Python guarantee that all characters in a Unicode object are between 0 and sys.maxunicode? Currently, it is possible to create Unicode strings with either negative or very large PyUNICODE elements.
- if the answer to the last question is no (i.e. if it is intentional that a unicode object can contain arbitrary PyUNICODE values): should Python then guarantee that PyUNICODE is an unsigned type? PyUNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that PyUNICODE is always unsigned. Martin fixed the crashing problem in 2.5 by making wchart unsigned which was a bug. (A configure test was reversed IIRC.) Can this change to wchart be made in 2.4? That technically changes all the interfaces even though it was a mistake. What should be done for 2.4?
If users want to interface from wchar_t to Python's Unicode type they have to go through the PyUnicode_FromWideChar() and PyUnicode_AsWideChar() interfaces. Assuming that Py_UNICODE is the same as wchar_t is simply wrong (and always was).
I also think that changing the type from signed to unsigned by backporting the configure fix will only make things safer for the user, since extensions will probably not even be aware of the fact that Py_UNICODE could be signed (it has always been documented to be unsigned).
So +1 on backporting the configure test fix to 2.4.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Apr 13 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
- Previous message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Next message: [Python-Dev] unicode vs buffer (array) design issue can crash interpreter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]