[Python-Dev] Bytes path support (original) (raw)
Stephen J. Turnbull stephen at xemacs.org
Sat Aug 23 10:02:25 CEST 2014
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Chris Barker writes:
The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory.
Read as bytes and incrementally decode. If you hit an Exception, retry from that point.
Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in?
If and only if there are no changes to the content.
I wonder if this would make it hard to preserve byte boundaries, though.
I'm not sure what you mean by "byte boundaries". If you mean after concatenation of such objects, yes, the uninterpretable bytes will be encoded in such a way as to be identifiable as lone bytes; they won't be interpreted as Unicode characters.
By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy.
Bad idea, especially for Oleg's use case -- you can't decode those by codec without reencoding to bytes first. No point in abandoning codecs just because there isn't one designed for his use case exactly. Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points and a very few plausible coding systems, which can be fairly well distinguished by the range of bytes used and probably nearly perfectly with additional information from the structure and distribution of apparently decoded characters.
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]