[Python-Dev] PEP-393/PEP-3118: unicode format specifiers (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Wed Mar 7 01:17:25 CET 2012


On Wed, Mar 7, 2012 at 4:15 AM, Stefan Krah <stefan at bytereef.org> wrote:

Victor Stinner <victor.stinner at gmail.com> wrote:

A Unicode string is an array of code point. Another approach is to expose such string as an array of uint8/uint16/uint32 integers. I don't know if you expect to get a character / a substring when you read the buffer of a string object. Using Python 3.2, I get:

>>> memoryview(b"abc")[0] b'a' ... but using Python 3.3 I get a number :-) Yes, that's changed because officially (see struct module) the format is unsigned bytes, which are integers in struct module syntax:

unsignedbytes = memoryview(b"abc") unsignedbytes.format 'B' chararray = unsignedbytes.cast('c') chararray.format 'c' chararray[0] b'a'

To maintain backwards compatibility, we should probably take the purity hit and officially change the default format of memoryview() to 'c', requiring the explicit cast to 'B' to get the new more bytes-like behaviour.

Using 'c' as the default format is a little ugly, but not as ugly as breaking currently working 3.2 code in the upgrade to 3.3.

Possibly the uint8/uint16/uint32 integer approach that you mention would make more sense.

Any changes made in this area should be aimed specifically at making life easier for developers dealing with ASCII puns in binary protocols. Being able to ask a string for a memoryview, and receiving one back with the format set to the appropriate value could potentially help with that by indicating:

ASCII: each code point is mapped to an integer in the range 0-127 latin-1: each code point is mapped to an integer in the range 0-255 UCS2: each code point is mapped to an integer in the range 0-65535 UCS4: each code point is mapped to an integer in the range 0-0x10FFFF

Using the actual code point values rather than bytes representations which may vary in length can help gloss over the differences in the underlying data layout. However, use cases should be explored more thoroughly first before any additional changes are made to the supported formats.

Cheers, Nick.

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-Dev mailing list