[Python-Dev] Internal representation of strings and Micropython (original) (raw)

Chris Angelico rosuav at gmail.com
Wed Jun 4 12:51:36 CEST 2014

Previous message: [Python-Dev] Internal representation of strings and Micropython
Next message: [Python-Dev] Internal representation of strings and Micropython
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmiscml at gmail.com> wrote:

That's another reason why people don't like Unicode enforced upon them - all the talk about supporting all languages and scripts is demagogy and hypocrisy, given a choice, Unicode zealots would rather limit people to Latin script then give up on their arbitrarily chosen, one-among-thousands, soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

Wrong. I use and recommend Unicode, with UTF-8 for transmission, and I do not ever want to limit people to Latin-1 or any other such subset. Even though English is the only language I speak, I am frequently using non-ASCII characters (eg when I discuss mathematics on a MUD), and if I could be absolutely sure that everyone in the conversation correctly comprehended Unicode, I could do this with a lot more confidence. Unfortunately, the server I use just passes bytes in and out, and some clients assume CP-1252, others assume Latin-1, and others (including my Gypsum) try UTF-8 first and fall back on an eight-bit encoding (currently CP-1252 because of the first group). But in an ideal world, server and clients would all speak Unicode everywhere, and transmit and receive UTF-8. This is not hypocrisy, this is the way to work reliably.

Once again, my claim is what MicroPython implements now is more correct - in a sense wider than technical - handling. We don't provide Unicode encoding support, because it's highly bloated, but let people use any encoding they like. That comes at some price, like length of strings in characters are not know to runtime, only in bytes, but quite a lot of applications can be written by having just that.

The current implementation is flat-out lying, actually. It claims that it's storing Unicode codepoints (as per the Python spec) while actually storing bytes, and then it transmits those bytes to the console etc as-is. This is a bug. It needs to be fixed. The only question is, what form will the fix take? Will it be PEP 393's flexible fixed-width representation? UTF-8? UTF-16 (I hope not!)? A hybrid of Latin-1 where possible and UTF-8 otherwise? But something has to be done.

ChrisA

Previous message: [Python-Dev] Internal representation of strings and Micropython
Next message: [Python-Dev] Internal representation of strings and Micropython
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list