[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Jim J. Jewett jimjjewett at gmail.com
Mon Sep 15 20:35:01 CEST 2014

Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote:

1. Java does not really have a Unicode type, therefore not one that validates. It has a String type that is a sequence of UTF-16 code units. There are some String methods and Character methods that deal with code points represented as int. I can put any 16-bit values I like in a String.

Including lone surrogates, and invalid characters in general?

2. With proper accounting for indices, and as long as surrogates appear in pairs, I believe operations like find or endswith give correct answers about the unicode, when applied to the UTF-16. This is an attractive implementation option, and mostly what we do.

So use it. The fact that you're having to smuggle bytes already guarantees that your data is either invalid or misinterpreted, and bug-free isn't possible.

In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire -- so it won't ever match entirely valid strings, except perhaps via a wildcard. And it should still work for .endswith().

3. I'm fixing some bugs where we get it wrong beyond the BMP, and the fix involves banning lone surrogates (completely). At present you can't type them in literals but you can sneak them in from Java.

So how will you ban them, and what will you do when some java class sends you an invalid sequence anyhow? That is exactly the use case for these smuggled bytes...

If you distinguish between a fully constructed PyString and a code-unit-sequence-that-could-be-made-into-a-PyString-later, then you could always have your constructor return an InvalidPyString subclass on the rare occasions when one is needed.

If you want to avoid invalid surrogates even then, just use the replacement character and keep a separate list of "original characters that got replaced in this string" -- a hassle, but no worse than tracking indices for surrogates.

4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against.

If you allow direct write access to the underlying charsequence (as CPython does to C extensions), then you can't really ban invalid sequences. If callers have to go through an API -- even something as minimal as getBytes or getChars -- then you can use whatever internal representation you prefer. Hopefully, the vast majority of strings won't actually have smuggled bytes.

-jJ

If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list