[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?] (original) (raw)

Guido van Rossum guido at python.org
Wed Feb 15 00:14:07 CET 2006

Previous message: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Next message: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2/14/06, M.-A. Lemburg <mal at egenix.com> wrote:

Guido van Rossum wrote: > As Phillip guessed, I was indeed thinking about introducing bytes() > sooner than that, perhaps even in 2.5 (though I don't want anything > rushed).

Hmm, that is probably going to be too early. As the thread shows there are lots of things to take into account, esp. since if you plan to introduce bytes() in 2.x, the upgrade path to 3.x would have to be carefully planned. Otherwise, we end up introducing a feature which is meant to prepare for 3.x and then we end up causing breakage when the move is finally implemented.

You make a good point. Someone probably needs to write up a new PEP summarizing this discussion (or rather, consolidating the agreement that is slowly emerging, where there is agreement, and summarizing the key open questions).

> Even in Py3k though, the encoding issue stands -- what if the file > encoding is Unicode? Then using Latin-1 to encode bytes by default > might not by what the user expected. Or what if the file encoding is > something totally different? (Cyrillic, Greek, Japanese, Klingon.) > Anything default but ASCII isn't going to work as expected. ASCII > isn't going to work as expected either, but it will complain loudly > (by throwing a UnicodeError) whenever you try it, rather than causing > subtle bugs later.

I think there's a misunderstanding here: in Py3k, all "string" literals will be converted from the source code encoding to Unicode. There are no ambiguities - a Klingon character will still map to the same ordinal used to create the byte content regardless of whether the source file is encoded in UTF-8, UTF-16 or some Klingon charset (are there any ?).

OK, so a string (literal or otherwise) containing a Klingon character won't be acceptable to the bytes() constructor in 3.0. It shouldn't be in 2.x either then.

I still think that someone who types a file in Latin-1 and enters non-ASCII Latin-1 characters in a string literal and then passes it to the bytes() constructor might expect to get bytes encoded in Latin-1, and someone who types a file in UTF-8 and enters non-ASCII Unicode characters might expect to get UTF-8-encoded bytes. Since they can't both get what they want, we should disallow both, and only allow ASCII.

Furthermore, by restricting to ASCII you'd also outrule hex escapes which seem to be the natural choice for presenting binary data in literals - the Unicode representation would then only be an implementation detail of the way Python treats "string" literals and a user would certainly expect to find e.g. \x88 in the bytes object if she writes bytes('\x88').

I guess we'l just have to disappoint her. Too bad for the person who wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO and certainly easier than a mixture of hex and ASCII like '\xabc\xdef'.

But maybe you have something different in mind... I'm talking about ways to create bytes() in Py3k using "string" literals.

I'm not sure that's going to be common practive except for ASCII characters used in network protocols.

>> While we're at it: I'd suggest that we remove the auto-conversion >> from bytes to Unicode in Py3k and the default encoding along with >> it. > > I'm not sure which auto-conversion you're talking about, since there > is no bytes type yet. If you're talking about the auto-conversion from > str to unicode: the bytes type should not be assumed to have any > properties that the current str type has, and that includes > auto-conversion.

I was talking about the automatic conversion of 8-bit strings to Unicode - which was a key feature to make the introduction of Unicode less painful, but will no longer be necessary in Py3k.

OK. The bytes type certainly won't have this property.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Next message: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list