[Python-Dev] PEP 263 - default encoding (original) (raw)

Martin v. Loewis martin@v.loewis.de
18 Mar 2002 09:04:05 +0100

Previous message: [Python-Dev] PEP 263 - default encoding
Next message: [Python-Dev] PEP 263 - default encoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

The parser accepts programs encoded in unicode. We provide some hooks to help you get from encodings convenient for your environment to Unicode, and some sample implementations of things to hang on the hooks. But if there are problems with non-unicode files, they're your problems."

I still can't see how this is different from what the PEP says. "encoded in Unicode" is, of course, a weak statement, since Unicode is not an encoding (UTF-8 would be). With the PEP, people can write source code in different encodings, but any problems they get are their problems.

o There may be some audiences who are poorly served (Mr. Suzuki).

In phase two of the PEP, I don't think there will be large audiences that are poorly served. If you want to write Python source in then-unsupported encodings, people can write "hooks" to support those. E.g. for importing modules, they just need to redefine import.

o I think it will definitely tend to encourage use of national/ platform encodings rather than UTF-8 in source. It may be hard to get this sun to set.

It is traditional Python policy not to take side on political debates. If this sun does not set, what is the problem?

o I think it makes it hard to implement helper tools (eg python-mode).

Harder than with those hooks? That's hard to believe. I assume you primarily care about editors here. Editors either support multiple encodings, or they don't. If they don't, you best write your source code in the encoding that your editor supports, and declare that for Python. If they do support different encodings, they may already correctly recognize the declared encoding. If not, you may need to add an additional declaration. Off-hand, I can't think of any editor where this might be necessary.

Guido> I think even Mr. Suzuki isn't thinking of using UTF-16 in Guido> his Unicode literals. He currently sets UTF-16 as the Guido> default encoding for data that he presumably reads from a Guido> file. =20 Well, I'm not a native Japanese. But I have often edited English strings that occur in swaths of unrecognizable octets that would be Japanese if I had the terminal encoding set correctly. I have also cut and pasted encoded Japanese into "binary" buffers. =20 And how is he going to use regexps or formatting sugar without literal UTF-16 strings?

In stage 1 of the implementation, he can use either UTF-8 or EUC-JP in Unicode literals. In stage 2, he can also use Shift_JIS and iso-2022-jp.

Right, as long as by "work" you mean "it's formally undefined but 8-bit clean stuff just passes through." The problem is that people often do unclean things, like type ALT 1 8 5 to insert an 0xB9 octet, which the editor assumes is intended to be =B9 in a Latin-2 locale. However, if that file (which the user knows contains no Latin-2 at all) is read in a Latin-2 locale, and translated to Unicode, the byte value changes (in fact, it's no longer a byte value). What's a parser to do?

I'm not sure I can follow this example. If you put byte 185 into a Python source code file, and you declare the file as Latin-2, what does that have to do with the locale? PEP 263 never mentions use of the locale for anything.

This can be made safe by not decoding the contents of ordinary string literals, but that requires that the parser has to do the lexing, you can't delegate it to a general-purpose codec.

Why is that? If the declared encoding of the file is Latin-2, the parser will convert it into Unicode, then parse it, then reconvert byte strings into Latin-2.

Bingo. And files which until that point embedded arbitrary binary (ie, not representing characters) stop working, quite possibly.=20=20

Breakage won't be silent, though. People will get a warning in phase 1, so they will know to declare an encoding.

If they have truly binary data in their source files (which I believe is rare), they are advised to change those to \x escapes.

This probably mostly works, based on mule experience. But it requires the parser to have carnal knowledge of coding systems. Isn't it preferable to insist on UTF-8 here, since it's simply changing the representation from one or two bytes back to constant-width one, without changing values?

It is no extra effort to support arbitrary encodings, compared to supporting UTF-8 only. The parser just calls into the codec library, and either gets an error or a Unicode string.

Also, you'd have to prohibit encodings using ISO 2022 control sequences, as there are always many legal ways to encode the same text (there is no requirement that a mode-switching sequence actually be followed by any text before switching to a different mode), and there's no way to distinguish except to record the original input.

That is indeed a problem - those byte strings would have different values at run-time. I expect that most users will accept the problem, since the strings still have their original "meaning".

Regards, Martin

Previous message: [Python-Dev] PEP 263 - default encoding
Next message: [Python-Dev] PEP 263 - default encoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]