[Python-Dev] bytes type discussion (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Feb 15 11:06:21 CET 2006


"Fred" == Fred L Drake, <fdrake at acm.org> writes:

Fred> On Tuesday 14 February 2006 22:34, Greg Ewing wrote:

>> Seems to me this is a case where you want to be able to change
>> encodings in the middle of reading the stream.  You start off
>> reading the data as ascii, and once you've figured out the
>> encoding, you switch to that and carry on reading.

Fred> Not quite.  The proper response in this case is often to
Fred> re-start decoding with the correct encoding, since some of
Fred> the data extracted so far may have been decoded incorrectly.
Fred> A very carefully constructed application may be able to go
Fred> back and re-decode any data saved from the stream with the
Fred> previous encoding, but that seems like it would be pretty
Fred> fragile in practice.

I believe GNU Emacs is currently doing this. AIUI, they save annotations where the codec is known to be non-invertible (eg, two charset-changing escape sequences in a row). I do think this is fragile, and a robust application really should buffer everything it's not sure of decoding correctly.

Fred> There may be cases where switching encoding on the fly makes
Fred> sense, but I'm not aware of any actual examples of where
Fred> that approach would be required.

This is exactly what ISO 2022 formalizes: switching encodings on the fly.

mboxes of Japanese mail often contain random and unsignaled encoding changes.

A terminal emulator may need to switch when logging in to a remote system.

-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.



More information about the Python-Dev mailing list