[Python-3000] Heaptypes (original) (raw)

Guido van Rossum guido at python.org
Thu Jul 19 20:32:14 CEST 2007

Previous message: [Python-3000] Heaptypes
Next message: [Python-3000] Heaptypes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 7/19/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

>> reduce currently does (O(s#)) with (obtype, obbytes, obsize). >> Now, s# creates a Unicode object, and the pickling fails to round-trip >> correctly. > > I thought that before your patch a bytes object roundtripped correctly > with all three protocols. Or maybe it got broken when s# was changed?

It did, and it got. s# used to return a str8, which then was pickled byte-for-byte. When s# started to return Unicode strings, bytes above 128 got widened to PyUNICODE (which is what currently PyUnicodeFromString does), so b'\xFF' became bytes('\uFFFF').

Ouch!!! This turns out to be a bug in PyUnicode_FronStringAndSize() due to signed characters. It can even cause a segfault:

Python 3.0x (py3k-struni, Jul 18 2007, 11:01:59) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

b"\x80".reduce() Segmentation fault

Fixed by applying Py_CHARMASK() to all occurrences of *u in that function. Committed revision 56460.

That got pickled and unpickled; then bytes('\uFFFF') is b'\xef\xbf\xbf' (because it applies the default encoding to the unicode argument), and it failed to roundtrip to b'\xFF'.

It's actually not possible to generate b'\xFF' using a unicode string argument, as string the default encoding will never return s'\xFF' (as that's not valid UTF-8).

But you can do it using bytes('\xff', 'latin-1'). I think that's a reasonable thing for bytes.reduce() to return.

> An additional requirement might be that if bytes are introduced in > 2.6, a pickle containing bytes written by 3.0 should be readable by > 2.6.

Sure: whatever we decide now needs to be applied to 2.6 also.

Right.

>> If reduce returns a Unicode object, what encoding should be assumed? >> (which then needs to be symmetric with bytes()) >> >> If reduce returns a str8 object, you will have to keep str8 (or >> else you cannot pickle bytes). > > When reduce returns a string at all, that means it's the name of a > global. I guess that should be encoded using UTF-8, so that as long as > the name is ASCII, 2.x can unpickle it. But I'm not sure if that's > what you were asking.

No. py> b'foo'.reduce() (<type 'bytes'>, ('foo',)) py> b'\xff'.reduce() (<type 'bytes'>, ('\uffff',)) It returns one string each time, as the first element of a one-element tuple (that is then passed to the bytes() constructor on unpickling)

I see. It returns a tuple containing a string. I was confused. Sorry. (But the \uffff is due to the bug above.)

> Anyway, one reason this is such a mess is clearly that the pickle > protocol has no independent spec -- it's grown organically in code. > Reverse-engineering the intent of the code is a pain.

That's also true, but I don't see it much as a problem here. If it had a spec, that spec would have said that b'S', b'T' and b'U' have a str payload. That spec would break if str8 goes away, and the spec would be changed to explain how these codes act in 2.x and 3.x. It would not talk at all about the bytes type, and that it's reduce might return different things in 2.x and 3.x (unless bytes gets a primitive code for pickle).

How about the following. it's not perfect but it's the best I can think of that doesn't break any pickles.

In 3.0, when an S, T or U pickle code is encountered, the returned value is a Unicode string decoded from the bytes using Latin-1. This means that all S, T or U pickle codes returns Unicode objects. In those cases where this was really meant to transfer binary data, the application running under 3.0 can fix this by calling bytes(X, 'latin-1'). If it was meant to be UTF-8-encoded text, the app can call str(Y, 'utf-8') after that.

But 3.0 should only generate the S, T or U pickle codes for str8 values (as long as that type exists) or for str values containing only 7-bit ASCII bytes; for all else it should use the unicode pickle codes.

For bytes, I propose that b"ab\xff".reduce() return (bytes, ("ab\xff", "latin-1")).

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-3000] Heaptypes
Next message: [Python-3000] Heaptypes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list