[Python-Dev] When do sets shrink? (original) (raw)

Noam Raphael noamraph at gmail.com
Sun Jan 1 00:08:26 CET 2006


On 12/31/05, Raymond Hettinger <raymond.hettinger at verizon.net> wrote:

[Noam] > For example, iteration over a set which once had > 1,000,000 members and now has 2 can take 1,000,000 operations every > time you traverse all the (2) elements.

Do you find that to be a common or plausible use case?

I don't have a concrete example in this minute, but a set which is repeatedly filled with elements and then emptied by pop operations doesn't seem to me that far-fetched.

Was Guido's suggestion of s=set(s) unworkable for some reason? dicts and sets emphasize fast lookups over fast iteration -- apps requiring many iterations over a collection may be better off converting to a list (which has no dummy entries or empty gaps between entries).

It's workable, but I think that most Python programmers haven't read that specific message, and are expecting operations which should take a short time to take a short time. Converting to a list won't help the use-case above, and anyway, it's another thing that I wouldn't expect anyone to do - there's no reason that iteration over a set should take a long time.

(I'm speaking of my point of view, which I believe is common. I don't expect programs I write in Python to be super-fast - if that were the case, I would write them in C. I do expect them to take a reasonable amount of time, and in the case of iteration over a set, that means a time proportional to the number of elements in the set.)

Would the case be improved by incurring the time cost of 999,998 tests for possible resizing (one for each pop) and some non-trivial number of resize operations along the way (each requiring a full-iteration over the then current size)? I believe it would. It seems to me that those 999,998 tests take not much more than a machine clock, which means about 1 milisecond on todays computers. Those resize operations will take some more miliseconds. It all doesn't really matter, since probably all other things will take much more. I now run this code

s = set() for j in xrange(1000000): ... s.add(j) ... while s: ... tmp = s.pop() ...

And it took 2.4 seconds to finish. And it's okay - I'm just saying that a few additional clock ticks per operation will usually not matter when the overall complexity is the same, but changes in order of complexity can matter much more.

Even if this unique case could be improved, what is the impact on common cases? Would a downsizing scheme risk thrashing with the over-allocation scheme in cases with mixed adds and pops? I think that there shouldn't be additional damage beyond those clock ticks. The simple method I took from "Introduction to Algorithms" works no matter what sequence of adds and pops you have.

Is there any new information/research beyond what has been obvious from the moment the dict resizing scheme was born?

I wanted to say that there isn't any new information, and yet I don't think that I have to assume that everything in current Python is the best that can be. All I did was finding another reason why a downsizing scheme might be good, and posting it to ask if people have thought about it. If you have a document listing all the design decisions that went into dict implementation, then please send it to me and I won't ask about things that were already thought about.

But the answer is, yes. I beleive that the current dict resizing scheme was born before the iterator protocol was introduced, and it may be a reason why the current scheme doesn't try to minimize the number of empty hash table entries.

Noam



More information about the Python-Dev mailing list