[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog) (original) (raw)
Jeff Allen ja.py at farowl.co.uk
Wed Sep 17 09:29:20 CEST 2014
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This feels like a jython-dev discussion. But anyway ...
On 17/09/2014 00:57, Stephen J. Turnbull wrote:
The CPython representation uses trailing surrogates only[1], so it's never possible to interpret them as anything but non-characters -- as soon as you encounter them you know that it's a lone surrogate. Surely you can do the same.
As long as the Java string manipulation functions don't check for surrogates, you should be fine with this representation. Of course I suppose your matching functions (etc) don't check for them either, so you will be somewhat vulnerable to bugs due to treating them as characters. But the same is true for CPython, AFAIK. They don't check. I agree that since only the trailing surrogate code points are allowed, you can tell that you have one, even in the UTF-16 form. The problem is that, if strings containing lone trailing surrogates are allowed, then:
u'\udc83' in u'abc\U00010083xyz' u'abc\U00010083xyz'.endswith(u'\udc83xyz')
are both True, if implemented in the obvious way on the UTF-16 representation. And this should not be so in Jython, which claims to be a wide build. (I can't actually type the second one, but I can get the same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I believe that the usual string operations work correctly on the UTF-16 version of the string, as long as indexes are adjusted correctly.
If we think it is ok that code using such methods give the wrong answer when fed strings containing smuggled bytes, then isolated (trailing) surrogates could be allowed. It's the user's fault for calling the method on that data. But I think it kinder that our implementation defend users from these wrong answers. In the latest state of Jython, we do this by rigorously preventing the construction of a PyUnicode containing a lone surrogate, so we can just use UTF-16 operations without further checks.
I'm not sure that rigour will be universally welcomed, and clearly it precludes PEP-383 byte smuggling.
Jeff
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]