pickle4 midterm report (original) (raw)

Abstract

===============================================================================

As has been initially described in my proposal, my work until midterm involved

improving the python version of pickle with the suggestions in PEP 3154 [1].

The progress so far has been tagged in the picklev4 repository at [2], under

the tag pickle4-midterm.

Features

===============================================================================

Here follows a list of the improvements which were made:

  1. pickling of very large bytes and strings

- not yet tested, need a PC with lots of memory or a smarter way to

test (such as creating a NullStream which simply drops everything

written to it except for the first few bytes and an object which

behaves like a very large bytes but doesn't actually store the data).

  1. more space-efficient pickling of small strings and bytes.

Tested by: test_v4_efficient_unicode, test_v4_efficient_binbytes

  1. native pickling with sets and frozensets, with support for

self-referential sets and frozensets.

Tested by: test_v4_sets, test_v4_set_refs, test_v4_sets_opcodes,

test_v4_cycles

  1. removed the BINPUT opcode (the pickler and unpickler now agree to use

the same indexing technique and agree on which opcodes memoize

something).

Tested by pretty much all of the tests, because unpickling would fail

if the strategy won't be consistent between the pickler and the

unpickler.

  1. better error checking in v4 and more consistent exceptions are raised

(typically only UnpicklingError, EOFError and MemoryError should be

raised by the unpickler in sane conditions).

Tested by: test_v4_exceptions

  1. removed GLOBAL in v4 in favor of binary equivalents, with the addition

of a BINGLOBAL_COMMON opcode (which has a list of commonly-used module

names).

Tested by: test_v4_binglobal_common, test_v4_binglobal_big,

test_v4_nested_classes

  1. pickling of nested classes, functions in classes and bound methods.

Tested by: test_v4_nested_classes, test_v4_pickle_bound_methods

  1. pickling/unpickling calls of __new__ with keyword args.

Tested by: test_v4_new_kwargs

  1. in case pickling on a stream where data can't be "taken back" once has

been written, the BAIL_OUT opcode has been added for the situations when

the pickler fails "late" and wants to inform an unpickler which reads from

the same stream that all the data it has unpickled so far is garbage and

should be abandoned.

- test_v4_bail_out

New opcodes: SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64,

SEMISHORT_BINBYTES, BINBYTES64, BINGLOBAL, BINGLOBAL_BIG, BINGLOBAL_COMMON,

EMPTY_SET, EMPTY_FROZENSET, UPDATE_SET, FROZENSET, BIND, NEWOBJ_KW,

BAIL_OUT

Example: small strings and no binput

===============================================================================

>>> l=[1, 2, 3, b'abc', 'abc']

>>> dis(dumps(l,4))

0: \x80 PROTO 4

2: ] EMPTY_LIST

3: ( MARK

4: K BININT1 1

6: K BININT1 2

8: K BININT1 3

10: C SHORT_BINBYTES 'abc'

15: \x8e SHORT_BINUNICODE 'abc'

20: e APPENDS (MARK at 3)

21: . STOP

highest protocol among opcodes = 4

>>> len(dumps(l,4)), len(dumps(l,3))

(22, 31)

Example: sets and frozensets

===============================================================================

>>> fs=frozenset([1,2,3]); s=set([1,2,fs,3])

>>> len(dumps(s,4)), len(dumps(s,3))

(20, 75)

>>> dis(dumps(s,4))

0: \x80 PROTO 4

2: \x94 EMPTY_SET

3: ( MARK

4: K BININT1 1

6: K BININT1 2

8: K BININT1 3

10: ( MARK

11: K BININT1 1

13: K BININT1 2

15: K BININT1 3

17: \x97 FROZENSET (MARK at 10)

18: \x96 UPDATE_SET (MARK at 3)

19: . STOP

highest protocol among opcodes = 4

>>> dis(dumps(s,3))

0: \x80 PROTO 3

2: c GLOBAL 'builtins set'

16: q BINPUT 0

18: ] EMPTY_LIST

19: q BINPUT 1

21: ( MARK

22: K BININT1 1

24: K BININT1 2

26: K BININT1 3

28: c GLOBAL 'builtins frozenset'

48: q BINPUT 2

50: ] EMPTY_LIST

51: q BINPUT 3

53: ( MARK

54: K BININT1 1

56: K BININT1 2

58: K BININT1 3

60: e APPENDS (MARK at 53)

61: \x85 TUPLE1

62: q BINPUT 4

64: R REDUCE

65: q BINPUT 5

67: e APPENDS (MARK at 21)

68: \x85 TUPLE1

69: q BINPUT 6

71: R REDUCE

72: q BINPUT 7

74: . STOP

highest protocol among opcodes = 2

>>> a=A(); s=set([1,2,a]); a.s=s

>>> dumps(s,4) # self-referential sets

b'\x80\x04\x94(K\x01K\x02\x93\x00\x01A)\x81}\x8e\x01sh\x00sb\x96.'

>>> dis(dumps(s,4))

0: \x80 PROTO 4

2: \x94 EMPTY_SET

3: ( MARK

4: K BININT1 1

6: K BININT1 2

8: \x93 BINGLOBAL_COMMON '0 A'

12: ) EMPTY_TUPLE

13: \x81 NEWOBJ

14: } EMPTY_DICT

15: \x8e SHORT_BINUNICODE 's'

18: h BINGET 0

20: s SETITEM

21: b BUILD

22: \x96 UPDATE_SET (MARK at 3)

23: . STOP

highest protocol among opcodes = 4

>>> a=A(); fs=frozenset([1,2,a]); a.fs=fs;

>>> dis(dumps(fs,4)) # self-referential frozensets

0: \x80 PROTO 4

2: ( MARK

3: K BININT1 1

5: K BININT1 2

7: \x93 BINGLOBAL_COMMON '0 A'

11: ) EMPTY_TUPLE

12: \x81 NEWOBJ

13: } EMPTY_DICT

14: \x8e SHORT_BINUNICODE 'fs'

18: ( MARK

19: K BININT1 1

21: K BININT1 2

23: h BINGET 1

25: \x97 FROZENSET (MARK at 18)

26: s SETITEM

27: b BUILD

28: 1 POP_MARK (MARK at 2)

29: h BINGET 4

31: . STOP

highest protocol among opcodes = 4

>>>

Example: Nested globals

===============================================================================

>>> class A:

... class B:

... def f(self):

... return self.x

...

>>> dis(dumps(A.B.f,4))

0: \x80 PROTO 4

2: \x93 BINGLOBAL_COMMON '0 A.B.f'

10: . STOP

>>> b=A.B(); b.x=42

>>> loads(dumps(A.B.f,4))(b)

42

>>> dumps(b.f,4)

0: \x80 PROTO 4

2: \x93 BINGLOBAL_COMMON '0 A.B'

8: ) EMPTY_TUPLE

9: \x81 NEWOBJ

10: } EMPTY_DICT

11: \x8e SHORT_BINUNICODE 'x'

14: K BININT1 42

16: s SETITEM

17: b BUILD

18: \x93 BINGLOBAL_COMMON '0 A.B.f'

26: \x98 BIND

27: . STOP

highest protocol among opcodes = 4

>>> loads(dumps(b.f,4))()

42

Work on the second part

===============================================================================

* Write patches for the discovered bugs and create an issue on the bugtracker

* Implement the opcodes that have already been implemented for the Python

version until all the tests pass.

* Finally, documentation and testing.

Conclusion

===============================================================================

I am on track with the work. Because the code is pretty well tested and

pickletools has been kept up to date accordingly, true test-driven development

can now be achieved for _pickle.

Here is a list of some other ideas that could be implemented before pickle4 is

released:

* Unicode compression of strings for more efficient pickling/unpickling

(eventually with an option to enable/disable this feature so the user can

choose a space-time tradeoff)

* Zigzag encoding of signed integers

* Implementation of SecurePickler/SecureUnpickler classes, which allow

unpickling of data coming from untrusted streams (such as networks).

Arbitrary code execution can be easily achieved at this time:

>>> from test.pickletester import AbstractPickleTests

>>> apt=AbstractPickleTests()

>>> evil=apt._make_pickle_str('{PROTO} \x04'

... '{BINGLOBAL} \x08builtins\x04eval'

... '{SHORT_BINUNICODE} \x0e print("evil!")'

... '{TUPLE1} {REDUCE} {STOP}')

>>> loads(evil)

evil!

A more secure pickle can currently be created by deriving from Unpickler

and overriding the find_class method, thus limiting what globals can be

looked up and which objects can be created. But this has several

disatvantages and limitations. A SecurePickler would have features such as

the following:

o Only a python implementation, because security is more important than

speed here, and it's easier to screw up in C than Python.

o Either never raise an exception in unpickling (such as by providing a

default value to the unpickler in case unpickling fails), or raise

only a known, small set of exceptions by using exception chaining

(hiding exceptions is bad).

o Disable obsolete opcodes (opcodes which have been replaced by newer

ones) to decrease the chance of finding security weaknesses.

o Have security parameters such as memo_maxsize, stack_maxsize,

recurse_maxdepth, allow_globals, limiting the resources that the

unpickler can use.

memo_maxsize => don't memoize more objects than what this value

indicates

stack_maxsize => bail out if the unpickler's stack goes above

this size

recurse_maxdepth => don't recurse above this depth when

unpickling

allow_globals => whether a limited subset of globals should be

allowed

o Have a few "profiles", where each profile is a dictionary with sane

values for the security parameters, so the developer doesn't have

to worry too much about what would be a good value for

memo_maxsize. E.g.:

basic_datatypes_only = {'memo_maxsize' : 100000,

'stack_maxsize' : 100000,

'recurse_maxdepth' : 10,

'allow_globals' : False}

o Take a allowed_globals security parameter, forbidding use of all

other globals not here.

o Maybe group certain opcodes (such as SHORT_BINBYTES,

SEMISHORT_BINBYTES, BINBYTES, BINBYTES64 in the group bytes) so the

developer can choose which groups of opcodes to allow.

Clearly, SecurePickler would only work with pickle>=4.

I doubt I can also implement the above in the given timeline, but I believe

they should be considered prior to releasing pickle4, because updating later

won't be possible without creating incompatibilities (with the possible

exception of SecureUnpickler, though problems could arise there too if

implementing the SecureUnpickler requires changing stuff in Unpickler to allow

a greater degree of control).

[1] http://www.python.org/dev/peps/pep-3154/

[2] https://bitbucket.org/mstefanro/pickle4/