[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)
Chris Angelico rosuav at gmail.com
Tue Sep 16 20:02:11 CEST 2014
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmurray at bitdance.com> wrote:
You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. Well, except that I do. The email header parsing algorithms all work fine if I treat the surrogate escaped bytes as 'unknown junk' and just parse based on the valid unicode. (Unless the header is so garbled that it can't be parsed, of course, at which point it becomes an invalid header).
Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters.
If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride.
You are right about the wrapping, though. If a header with invalid bytes (and in this scenario we are talking about errors) needs to be wrapped, we have to first decode the smuggled bytes and turn it into an 'unknown-8bit' encoded word before we can wrap the header.
Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section?
ChrisA
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]