[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)
Jim Baker jim.baker at python.org
Tue Sep 16 19:55:31 CEST 2014
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Great points here - I especially like the concluding statement "you can't treat it as a pure Unicode string - it's a Unicode string with smuggled bytes"
Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a low surrogate in range (D800, DC00), then a high surrogate in range(DC00, E000). So one can likely assign an interpretation that this is in fact the isolated surrogate, and not an actual codepoint.
Of course, if you do actually have a smuggled isolated low surrogate FOLLOWED by a smuggled isolated high surrogate - guess what, the only interpretation is a codepoint. Or perhaps more likely garbage. Of course it doesn't happen so often, so maybe we are fine with the occasional bug ;)
I personally suspect that we will resolve this by also supporting UCS-4 as a representation in Jython 3.x for such Unicode strings, albeit with the limitation that we have simply moved the problem to when we try to call Java methods taking java.lang.String objects.
- Jim
On Tue, Sep 16, 2014 at 9:27 AM, Chris Angelico <rosuav at gmail.com> wrote:
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray <rdmurray at bitdance.com> wrote: > That isn't the case in the email package. The smuggled bytes are not > errors[*], they are literally smuggled bytes.
But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them as characters. At best, they represent characters but in some encoding other than the one you're using, and you have no idea how many bytes form a character or anything. So you can't, for instance, word-wrap the text, because you can't know how wide these unknown bytes are, whether they represent spaces (wrap points), or newlines, or anything like that. You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. ChrisA
Python-Dev mailing list Python-Dev at python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/jbaker%40zyasoft.com
--
- Jim
jim.baker@{colorado.edu|python.org|rackspace.com|zyasoft.com} twitter.com/jimbaker github.com/jimbaker bitbucket.com/jimbaker -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140916/2d4ac346/attachment.html>
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]