[Python-Dev] Bytes path support (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Aug 28 04:04:01 CEST 2014

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Glenn Linderman writes:

On 8/27/2014 5:16 AM, Nick Coghlan wrote:

Choosing UTF-8 aims to treat formatting text for communication with the user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price:

because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application.

I don't believe this is a necessary result of using UTF-8.

No, it's not, but if you're going to do the same kind of checks that are necessary for transcoding UTF-8 to abstract Unicode, there's no benefit to using UTF-8 internally, and you lose a lot. The only operations that you can do efficiently are concatenation and iteration. I've worked with a UTF-8-like internal encoding for 20 years now -- it's a huge cost.

Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today.

Thank heaven it didn't!

One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation,

That simply isn't true. The negative effects of algorithmic slowness in Emacsen are visible both as annoying user delays, and as excessive developer concentration on optimizing a fundamentally insufficient data structure.

and of limited general applicability when considering grapheme clusters. An iterator based approach can solve both problems,

On the contrary, grapheme clusters are the relatively rare use case in textual computing, at least currently, that can be optimized for when necessary. There's no problem with creating iterators from arrays, but making an iterator behave like a array ... well, that involves creating the array.

Such solutions could still be implemented as options.

Sure, but the problems to be solved in that implementation are not due to Python 3's internal representation. A lot of painstaking (and possibly hard?) work remains to be done.

A high-performance implementation would likely need to be implemented at least partly in C rather than CPython,

That's how Emacs did it, and (a) over the decades it has involved an inordinate amount of effort compared to rewriting the text-handling functions for an array, (b) is fragile, and (c) performance sucks in practice.

Unicode, not UTF-8, is the central component of the solution. The various UTFs are application-specific implementations of Unicode. UTF-8 is an excellent solution for text streams, such as disk files and network communication. Fixed-width representations (ISO-8859-1, UCS-2, UTF-32, PEP-393) are useful for applications of large buffers that need O(1) "random" access, and can trivially be iterated for stream applications.

Steve

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list