Message 85933 - Python tracker (original) (raw)

On Apr 11, 2009, at 8:15 AM, Martin v. Löwis
<report@bugs.python.org>@psf.upfronthosting.co.za
@psf.upfronthosting.co.za> wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

By the way, defaultdict is NOT like setdefault--it is like get(). Missing entries do no get set.

Why do you say that?

missing(...) missing(key) # Called by getitem for missing key;
pseudo-code: if self.default_factory is None: raise KeyError((key,)) self[key] = value = self.default_factory() return value

In all cases of setdefault that I know of, replacing this with a defaultdict would be appropriate. The only case where it wouldn't work is if the default value depends on the key. The missing key reports the default object, but doesn't set that key
or value object in the dict. So you cannot then call up that object and do
something that depends on the key. So the default value may not depend on the key but still need to be different for each key due to later changes.

For example, to represent a graph structure, it is helpful to use a
dict-of-dicts. The first time a node is looked up you want to add it as a key in the
graph dict with an empty "nbr dict" as the value. the "nbr dict" is keyed by
neighbor to the edge weight/object. So the default object is always a fresh nbr dict...
but what gets added to that nbr dict depends on which node it is. I can come up
with examples for lists too... But as you pointed out before, using a list or
dict in setdefault requires creating the object before the call anyway... I'm beginning to question whether setdefault should ever be used... Still comes back
to--why have it there inefficiently. Besides I'm still interested in being able to do this sort of thing at least through the C-API...

Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

from my perspective creating an internal SetItem adds another function handling the data structure just as setdefault would Incorrect comparison. Your in-lining manipulated the ep structure directly (not a good thing). In contrast, adding an alternative _PyDict_SetItemWithHash uses the insertdict() function, fully
isolating itself of from the table implementation. The whole purpose of having insertdict() and lookdict() is to isolate the data structure internals from all externally visible functions.

Good... I am understanding better what you mean by "handling the data
structure". I agree that lookdict, insertdict and resizedict should be the three
that change the data structure. But the way those 3 are currently written, you can't
change an entry without looking it up. insertdict_clean almost does it, but it
assumes there aren't any dummy entries.

The double call to a very simple user defined hash adds .07 to
call that takes on .05 with the double call to builtin object hash. So, we could save half of the the .07 just by elimating the double call to hash. With more complex hash functions the overall speedup is
huge (essentially cutting the total work almost in half).

This is a good example because it shows how the hash can matter.

I think there are similar examples where the lookup is most of the
time of dict access. I don't know enough about what causes collisions and how dicts are
optimized to quickly come up with such an example. But such an example must
exist or Tim Peters wouldn't have spent so much effort optimizing lookup.... besides his comments at the top of lookup suggest that we do exactly
what I am suggesting to do. Get the entry from lookup and then "the caller can (if it wishes) add the <key, value> pair to the
returned PyDictEntry*.

Presently there is no way to do this with C-API except to write my
own data structure manipulation. Wouldn't it be better to encapsulate this in the C-API
in a standard way instead of having everybody writing their own C-extension doing
setdefault the right way? (ok..ok.. "everybody" here refers to probably one person (me)... but
others have thought of it. I've even seen the double lookup as a justification for never
using setdefault)

I'll look for an example that has lots of collisions. Maybe that
would help justify a lookup-less insertdict. Dan