[Python-Dev] PEP 460 reboot (original) (raw)
Nick Coghlan ncoghlan at gmail.com
Tue Jan 14 06:25:29 CET 2014
- Previous message: [Python-Dev] PEP 460 reboot
- Next message: [Python-Dev] PEP 460 reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 14 January 2014 15:03, Guido van Rossum <guido at python.org> wrote:
I don't think it's that easy. Just searching for '{' is enough to break in surprising ways unless the format string is encoded in an ASCII superset. I can think of two easy examples to illustrate this (they're similar to the example I posted here before about the essential ASCII-ness of %c).
First, let's consider EBCDIC. The '{' character in ASCII is hex 7B (decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC) and that is the '#' character in EBCDIC. Surprised yet? Next, let's consider UTF-16. This encoding uses two bytes per character (except for surrogates), so any character whose top half or bottom half happens to be 7B hex will cause an incorrect hit for your regular expression. Ouch. Of course, nobody in their right mind would use a format string containing UTF-16 or EBCDIC. And that is precisely my point. When you're using a format string, all of the format string (not just the part between { and }) had better use ASCII or an ASCII superset. And this (rightly) constrains the output to an ASCII superset as well.
In case it got lost amongst the various threads, this was the argument that finally convinced me that interpolation inherently assumes an ASCII compatible encoding: the assumption of ASCII compatibility is embedded in the design of the formatting syntax for both printf-style formatting and the format methods. That places interpolation support squarely in the same category as all the other bytes methods that inherently assume ASCII, and thus remains consistent with the Python 3 text model.
Originally I was thinking that the ASCII assumption applied only if one of the passed in values needed to be implicitly encoded as ASCII, without accounting for the fact that the parser itself assumed ASCII compatibility when searching for formatting metacharacters. Once Guido pointed out that oversight on my part, my objections collapsed, since this observation makes it clear that there's no coherent way to offer a pure binary interpolation API - the only general purpose combination mechanism for segments of binary data that can avoid making assumptions about the encodings of metacharacters is simple concatenation.
Regards, Nick.
-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
- Previous message: [Python-Dev] PEP 460 reboot
- Next message: [Python-Dev] PEP 460 reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]