Message 93590 - Python tracker (original) (raw)

We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer.

UCS2 died long ago, is there any reason why we keep using an UCS2 that "appears" like UTF-16 instead of real UTF-16?

This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds.

I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP.

What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)?