[Python-Dev] PEP-393/PEP-3118: unicode format specifiers (original) (raw)

Victor Stinner victor.stinner at gmail.com
Tue Mar 6 17:43:43 CET 2012


In the array module the 'u' specifier previously meant "2-bytes, on wide builds 4-bytes". Currently in 3.3 the 'u' specifier is mapped to UCS4.

I think it would be nice for Python3.3 to implement the PEP-3118 suggestion: 'c' -> UCS1 'u' -> UCS2 'w' -> UCS4

A Unicode string is an array of code point. Another approach is to expose such string as an array of uint8/uint16/uint32 integers. I don't know if you expect to get a character / a substring when you read the buffer of a string object. Using Python 3.2, I get:

memoryview(b"abc")[0] b'a'

... but using Python 3.3 I get a number :-)

memoryview(b'abc')[0] 97

It is no more possible to create a Unicode string containing characters outside U+0000-U+10FFFF range. You might apply the same restriction in the buffer API for UCS4. It may be inefficient, the check can be done when you convert the buffer to a string.

Actually we could even add 'a' -> ASCII

ASCII implies that the values are in the range U+0000-U+007F (0-127). Same as the UCS4: you may do the check in the buffer API or when the buffer is converted to string.

I don't think that it would be useful to add an ASCII buffer type, because when the buffer is converted to string, Python has to recompute the maximum character (to choose between ASCII, UCS1, UCS2 and UCS4). For example, "abc\xe9"[:-1] is ASCII. UCS1 is enough for ASCII strings.

Victor



More information about the Python-Dev mailing list