(original) (raw)

On 8/27/2014 5:16 AM, Nick Coghlan wrote:

On 27 August 2014 08:52, Nick Coghlan  wrote:

On 27 Aug 2014 02:52, "Terry Reedy"  wrote:

Nick, I think the first half of your post is one of the clearest  
expositions yet of 'why Python 3' (in particular, the str to unicode  
change). It is worthy of wider distribution and without much change, it  
would be a great blog post.

Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption.

The blog post version is already mostly written, but I ran out of weekend.
Will hopefully finish it up and post it some time in the next few days :)

Aaand, it's up:
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html

Cheers,
Nick.

Indeed, I also enjoyed and found enlightening your response to this
issue, including the broader historical context. I remember when
Unicode was first published back in 1991, and it sounded
interesting, but far removed from the reality of implementations of
the day. I was intrigued by UTF-8 at the time, and even wrote an
encoder and decoder for it for a software package that eventually
never reached any real customers.

Your blog post says:

Choosing UTF-8 aims to treat formatting text for communication
with the user
as "just a display issue". It's a low impact design that will
"just work" for
a lot of software, but it comes at a price:

because encoding consistency checks are mostly avoided, data
in different encodings may be freely concatenated and passed
on to other applications. Such data is typically not usable by
the receiving application.

I don't believe this is a necessary result of using UTF-8. It is a
possible result, and I guess some implementations are using it this
way, but a proper language could still provide and/or require proper
usage of UTF-8 data through its type system just as Python3 is doing
with PEP 393. In fact, if it were not for the requirement to
support passing character strings in other formats (UTF-16, UTF-32)
to historical APIs (in CPython add-on packages) and the resulting
practical performance considerations of converting to/from UTF-8
repeatedly when calling those APIs, Python3 could have evolved to
using UTF-8 as its underlying data format, and obtained equal
encoding consistency as it has today.

Of course, nothing can be "required" if the user chooses to continue
operating in the encoded domain, and manipulate data using the
necessary byte-oriented features of of whatever language is in use.

One of the choices of Python3, was to retain character indexing as
an underlying arithmetic implementation citing algorithmic speed,
but that is a seldom needed operation, and of limited general
applicability when considering grapheme clusters. An iterator based
approach can solve both problems, but would have been best
introduced as part of Python3.0, although it may have made 2to3
harder, and may have made it less practical to implement six and
other "run on both Py2 and Py3" type solutions harder, without
introducing those same iterative solutions into Python 2.6 or 2.7.

Such solutions could still be implemented as options. Even PEP 393
grudgingly supports some use of UTF-8 when requested by the user, as
I understand it. Whether such an implementation would be better
based on bytes or str is uncertain without further analysis,
although type checking would probably be easier if based on str. A
high-performance implementation would likely need to be implemented
at least partly in C rather than CPython, although it could be
prototyped in Python for proof of functionality. The iterators could
obviously be implemented to work based on top of solutions such as
PEP 393, by simply using indexing underneath, when fixed-width
characters are available, and other techniques when UTF-8 is the
only available format (rather than converting from UTF-8 to
fixed-width characters because of calling the iterator).