[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)
Jeff Allen ja.py at farowl.co.uk
Sat Sep 13 00:16:30 CEST 2014
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jim, Stephen:
It seems like we're off topic here, but to answer all as briefly as possible:
- Java does not really have a Unicode type, therefore not one that validates. It has a String type that is a sequence of UTF-16 code units. There are some String methods and Character methods that deal with code points represented as int. I can put any 16-bit values I like in a String.
- With proper accounting for indices, and as long as surrogates appear in pairs, I believe operations like find or endswith give correct answers about the unicode, when applied to the UTF-16. This is an attractive implementation option, and mostly what we do.
- I'm fixing some bugs where we get it wrong beyond the BMP, and the fix involves banning lone surrogates (completely). At present you can't type them in literals but you can sneak them in from Java.
- I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against.
Jeff Allen
On 12/09/2014 16:37, Jim J. Jewett wrote:
On September 11, 2014, Jeff Allen wrote: ... "surrogateescape" is an error handler, not a codec. True, but I believe that is a CPython implementation detail. Other implementations (including jython) should implement the surrogatescape API, but I don't think it is important to use the same internal representation for the invalid bytes. lone surrogates preclude a naive use of the platform string library Invalid input often causes problems. Are you saying that there are situations where the platform string library could easily handle invalid characters in general, but has a problem with the specific case of lone surrogates?
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]