[Python-Dev] Encoding variable-length integers/counts in pickle (original) (raw)

MRAB python at mrabarnett.plus.com
Mon Jul 9 21:53:47 EDT 2018


In the regex module I use an encoding scheme when pickling pattern objects which is based on the way MIDI encodes variable-length integers, and I think it might have a use in a pickle protocol.

In the basic format, an integer is split up into 7-bit chunks, each chunk put into a byte, and the most-significant bit of the byte used to signal whether the value continues into the following byte.

And integer must be encoded into the minimum number of bytes, so an encoded sequence of bytes would never start with 0x80.

MIDI deviates from the basic idea in order to reduce the number of bytes, so as sequence of bytes in MIDI can start with x080; this is fine for MIDI because it doesn't need to represent negative integers.

The format I'm suggesting uses an initial 0x80 as a way of letting it encode negative integers.

Here are a couple of Python functions that summarise the encoding and decoding (minus obvious optimisations for simplicity):

def encode_varint(value: int) -> List[int]: negative = value < 0 encoded = []

 if negative:
     final = -1
 else:
     final = 0

 while True:
     encoded.append(0x80 | (value & 0x7F))
     value >>= 7

     if value == final:
         break

 if negative:
     encoded.append(0x80)

 encoded.reverse()

 encoded[-1] &= 0x7F

 return encoded

def decode_varint(encoded: Iterable[int]) -> int: byte = next(encoded)

 if byte == 0x80:
     value = -1
     byte = next(encoded)
 else:
     value = 0

 value = (value << 7) | (byte & 0x7F)

 while (byte & 0x80) != 0:
     byte = next(encoded)
     value = (value << 7) | (byte & 0x7F)

 return value

The advantage of encoding integers in this way is that there's no limit to their size, so there's no need to add a new protocol to support larger counts.

They can also make pickles smaller.

Example:

 # Pickle (None, )
 0: \x80 PROTO      4
 2: \x95 FRAME      4
11: N    NONE
12: \x85 TUPLE1
13: \x94 MEMOIZE    (as 0)
14: .    STOP

Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a version that accepted a variable-length count, you could reduce that argument to 1 byte.

You also wouldn't need to have different fixed-length versions of an opcode, e.g. BINUNICODE and SHORT_BINUNICODE.

Whether you do anything with this is entirely up to the core devs, I just thought someone might find it useful.



More information about the Python-Dev mailing list