[Python-Dev] len(chr(i)) = 2? (original) (raw)

M.-A. Lemburg mal at egenix.com
Thu Nov 25 10:57:17 CET 2010


Alexander Belopolsky wrote:

On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote: ..

> I note that an opinion has been raised on this thread that > if we want compressed internal representation for strings, we should > use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as > too hard to implement. What makes UTF-16 easier than UTF-8? Only the > fact that you can ignore bugs longer, in my view.

That's mostly true. My guess is that we can probably ignore those bugs for as long as it takes someone to write the higher-level libraries that James suggests and MAL has actually proposed and started a PEP for. As far as I can tell, that PEP generated grand total of one comment in nine years. This may or may not be indicative of how far away we are from seeing it implemented. :-)

At the time it was too early for people to start thinking about these issues. Actual use of Unicode really only started a few years ago.

Since I didn't have a need for such an indexing module myself (and didn't have much time to work on it anyway), I punted on the idea.

If someone else wants to pick up the idea, I'd gladly help out with the details.

As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even more far fetched. Once upon a time, Python Unicode strings supported buffer protocol and would lazily fill an internal buffer with bytes in the default encoding. In 3.x the default encoding has been fixed as UTF-8, buffer protocol support was removed from strings, but the internal buffer caching (now UTF-8) encoded representation remained. Maybe we can now implement defenc logic in reverse. Recall that strings are stored as UCS-2/4 sequences, but once buffer is requested in 2.x Python code or char* is obtained via PyUnicodeAsStringAndSize() at the C level in 3.x, an internal buffer is filled with UTF-8 bytes and defenc is set to point to that buffer.

The original idea was for that buffer to go away once we moved to Unicode for strings. Reality has shown that we still need to stick the buffer, though, since the UTF-8 representation of Unicode objects is used a lot.

So the idea is for strings to store their data as UTF-8 buffer pointed by defenc upon construction. If an application uses string indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer. Proper, Unicode-aware algorithms such as grapheme, word or line iteration or simple operations such as concatenation, search or substitution would operate directly on defenc buffers. Presumably over time fewer and fewer applications would use code unit indexing that require UCS-2/4 buffer and eventually Python strings can stop supporting indexing altogether just like they stopped supporting the buffer protocol in 3.x.

I don't follow you: how would UTF-8, which has even more issues with variable length representation of code points, make something easier compared to UTF-16, which has far fewer such issues and then only for non-BMP code points ?

Please note that we can only provide one way of string indexing in Python using the standard s[1] notation and since we don't want that operation to be fast and no more than O(1), using the code units as items is the only reasonable way to implement it.

With an indexing module, we could then let applications work based on higher level indexing schemes such as complete code points (skipping surrogates), combined code points, graphemes (ignoring e.g. most control code points and zero width code points), words (with some customizations as to where to break words, which will likely have to be language dependent), lines (which can be complicated for scripts that use columns instead ;-)), paragraphs, etc.

It would also help to add transparent indexing for right-to-left scripts and text that uses both left-to-right and right-to-left text (BIDI).

However, in order for these indexing methods to actually work, they will need to return references to the code units, so we cannot just drop that access method.

In any case, I think this discussion is losing its grip on reality.

By far, most strings you find in actual applications don't use surrogates at all, so the problem is being exaggerated.

If you need to be careful about surrogates for some reason, I think a single new method .hassurrogates() on string objects would go a long way in making detection and adding special-casing for these a lot easier.

If adding support for surrogates doesn't make sense (e.g. in the case of the formatting methods), then we simply punt on that and leave such handling to other tools.

It is by far more important to maintain round-trip safety for Unicode data, than getting every bit of code work correctly with surrogates (often, there won't be a single correct way).

With a new method for fast detection of surrogates, we could protect code which obviously doesn't work with surrogates and then consider each case individually by either adding special cases as necessary or punting on the support.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Nov 25 2010)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/



More information about the Python-Dev mailing list