[Python-Dev] Bytes path related questions for Guido (original) (raw)

R. David Murray rdmurray at bitdance.com
Thu Aug 28 20:43:51 CEST 2014


On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:

On 8/28/2014 10:41 AM, R. David Murray wrote: > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote: >> On 8/28/2014 12:30 AM, MRAB wrote: >>> There'll be a surrogate escape if a byte couldn't be decoded, but just >>> because a byte could be decoded, it doesn't mean that it's correct. >>> >>> If you picked the wrong encoding, the other codepoints could be wrong >>> too. >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it >> is also pretty useless to "replacesurrogateescapes" at all, because it >> only cleans out the non-decodable characters, not the incorrectly >> decoded characters. > Well, replace would still be useful for ASCII+surrogateescape.

How?

Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more likely to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well).

> Also for > cases where the data stream is supposed to be in a given encoding, but > contains undecodable bytes. Showing the stuff that incorrectly decodes > as whatever it decodes to is generally what you want in that case.

Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with

Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case.

that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?

Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :)

--David



More information about the Python-Dev mailing list