(original) (raw)

On Jan 18, 2012 12:55 PM, Martin v. L�wis <martin@v.loewis.de> wrote:
>
> Am 18.01.2012 17:01, schrieb PJ Eby:
> > On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. L�wis" <martin@v.loewis.de
> > <mailto:martin@v.loewis.de>> wrote:
> >
> > � � Am 17.01.2012 22:26, schrieb Antoine Pitrou:
> > � � > Only 2 bits are used in ob\_sstate, meaning 30 are left. These 30 bits
> > � � > could cache a "hash perturbation" computed from the string and the
> > � � > random bits:
> > � � >
> > � � > - hash() would use ob\_shash
> > � � > - dict\_lookup() would use ((ob\_shash \* 1000003) ^ (ob\_sstate & \~3))
> > � � >
> > � � > This way, you cache almost all computations, adding only a computation
> > � � > and a couple logical ops when looking up a string in a dict.
> >
> > � � That's a good idea. For Unicode, it might be best to add another slot
> > � � into the object, even though this increases the object size.
> >
> >
> > Wouldn't that break the ABI in 2.x?
>
> I was thinking about adding the field at the end, so I thought it
> shouldn't. However, if somebody inherits from PyUnicodeObject, it still
> might - so my new proposal is to add the extra hash into the str block,
> either at str\[-1\], or after the terminating 0\. This would cause an
> average increase of four bytes of the storage (0 bytes in 50% of the
> cases, 8 bytes because of padding in the other 50%).
>
> What do you think?

So far it sounds like the very best solution of all, as far as backward compatibility is concerned.� If the extra bits are only used when two strings have a matching hash value, the only doctests that could be affected are ones testing for this issue.� ;-)