Issue 629637: Add a sample selection method to random.py (original) (raw)
Created on 2002-10-28 02:03 by rhettinger, last changed 2022-04-10 16:05 by admin. This issue is now closed.
Messages (33)
Author: Raymond Hettinger (rhettinger) *
Date: 2002-10-28 02:03
random.randset(n, k) returns a k length list of unique integers in the range [0,n).
Improves on a Cookbook submission by using the parameters to select between a shuffle algorithm and a dictionary algorithm.
I want to add this to the library because it is a simple, robust solution to a general selection problem and because it isn't obvious that two different algorithms are needed to balance speed/space trade-offs.
If approved, will add docs and a news item.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-10-28 12:54
Logged In: YES user_id=80475
Added full patch with news item and docs.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-10-29 04:42
Logged In: YES user_id=80475
Renamed to random.sample(n,k) to show that it is used for sampling without replacement.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-10-29 04:42
Logged In: YES user_id=80475
Renamed to random.sample(n,k) to show that it is used for sampling without replacement.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-10-31 07:29
Logged In: YES user_id=80475
Added new version with local variable optimization and with the dictionary results returned in selection order.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-05 06:33
Logged In: YES user_id=80475
Martin, do you have time to give this patch a second review?
Author: Martin v. Löwis (loewis) *
Date: 2002-11-05 08:27
Logged In: YES user_id=21627
Can you explain why this needs to be in the standard library? I.e. what typical application would use it?
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-05 15:36
Logged In: YES user_id=80475
Like shuffle() and choose(), random sampling without replacement is one of the core principal use cases for random numbers.
Acceptance testing often requires a fixed number of non- overlapping samples i.e. Selecting 60 transactions out of a 1000 and finding zero errors yields a 95% confidence that the population has less than a 5% error rate.
Some simulations also need groups of non-overlapping samples i.e. a lottery result of six unique numbers selected from a range of 1 to 57. An electronic raffle picks consecutive winners without allowing previous winners to be reselected.
While sampling with replacement is trivial to implement with a list comprehension, sampling without replacement has a number of implementation nuances that makes it worthwhile to have a robust solution already implemented in the random library.
Author: Martin v. Löwis (loewis) *
Date: 2002-11-05 17:34
Logged In: YES user_id=21627
Thanks for the explanation. On to the implementation: How did you arrive at the factor of 6 between a dictionary and a list?
The documentation should mention the random optional argument.
Author: Tim Peters (tim.peters) *
Date: 2002-11-05 18:31
Logged In: YES user_id=31435
I agree this is useful, but would rather see Python grow libraries for combinatorial objects. There are many things beyond this that are also useful, For example, the examples you gave here were of selections from collections that aren't range(n), and it would be more useful to more people to have a way to choose k elements from an arbitrary n-element collection directly (like a collection of transactions, or a set of cards, whatever).
Note that I posted a module to Python-Dev not long ago that implements such stuff (CombGen.py), along with other useful functions on combinations.
Note that when k > n/2, "the usual trick" isn't to shuffle a list,
but to generate a complement selection. For example, if you
want a random sample of 9999 out of 10000, it's a lot more
efficient to pick the single element that's not in the result.
See CombGen for code to do this.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-05 21:38
Logged In: YES user_id=80475
Thanks for the quick follow-ups.
The switchover ratio of six came from counting pointers and longs. Shuffling uses an n length list at one pointer for each element. The dictionary approach has k elements with a hash code, a key pointer, and a value pointer for a total of three multiplied by 1.5 and rounding up to five (because dict loading is kept under 2/3) and one pointer for the 'inorder' return list for a total of six. Also, I liked six to minimize resampling in the dictionary approach (keeping it under 20%).
As requested, I'll add the random argument to the documentation.
Originally, I was going to have sample() select from an arbitrary collection (like choose() does) but, in the end, preferred the current approach of choosing integers. This approach allows sample(1000000,60) without building a giant list first. Also, converting from indices to elements is trivial: [colorlist[i] for i in random.sample(len (colorlist),5)].
I avoided the n/2 complement selection technique because of use case rarity and to allow the sample itself to be in random order (oxymoron?). If you guys think it's necessary, I'll add a complement selection branch followed by a call to random.shuffle(). Still, as it stands, the code is robust, uses space no larger than a k sized dictionary, and runs with no more than 1.2*k calls to random().
I don't know why CombGen.py never made it to Tools/scripts. Even if it does, I think a random sampling function belongs in the random module where people can find it -- it is a very common use of random numbers.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-07 23:30
Logged In: YES user_id=80475
Assigned to GvR for pronouncement on a) whether he agrees that a sampling function is useful. b) whether to implement it as is or with sequence arguments c) whether to leave it in random or put in another module.
The current form returns a list of integers that can be used directly or as indices into a sequence. The advantages are flexibility in use and the ability to pick a hundred elements out of ten million without building a long list first. The approach is essentially a uniquified list of calls to randrange(). Tim prefers an approach that parallels random.choice() where the call looks like: random.sample([a,b,c,d,e], 2) # picks 2 of the 5 objects
I think the function belongs in the random module since it is a primary use of random numbers (just like shuffle() and choose()). Tim prefers to have a separate library module that has a whole grab bag of combinatorics.
Author: Guido van Rossum (gvanrossum) *
Date: 2002-11-08 00:55
Logged In: YES user_id=6380
I'm not even looking at this, I'm delegating this to Tim. He knows infinitely more about random and permutations than I do, and he's actually used this stuff.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 06:36
Logged In: YES user_id=80475
I'm re-learning to hate the patch process. This was a straight-forward, thoroughy tested, useful patch. Getting it accepted wasn't supposed to be hard.
What is the next step -- Take it as is, convert the n argument to choice() style population list, or withdraw the patch?
Author: Martin v. Löwis (loewis) *
Date: 2002-11-08 06:52
Logged In: YES user_id=21627
Well, I agree that the patch is correct in the sense of doing what it says it does. What I cannot judge is whether the feature is useful; it looks like bloat to me.
I could be convinced if you find a user of this function (or the Cookbook recipe) who says I use it for this and that, and I would prefer to see it in the library for that reason, instead of copying it from the Cookbook.
I have the feeling that anybody who would use such a function would also use ten other "standard" functions which are not included in the library at the moment. So that person would not be helped with getting the single function; he would need an entirely new library of such things.
So I would propose that you withdraw the patch.
Author: Tim Peters (tim.peters) *
Date: 2002-11-08 08:59
Logged In: YES user_id=31435
Well, you're in murky waters because it's a "new feature" patch rather than a bugfix, and wasn't vetted on Python- Dev or c.l.py or via PEP first, nor is it a function in wide use already, neither one that people have asked for in the "small feature requests" PEP. It appeared out of the blue, and "unsolicited"/undiscussed new features are usually hard sells. The alternative is boundless bloat.
Python went for years without random.shuffle(), and that got added because (a) at any given moment, you were likely to find a c.l.py discussion about someone's incorrect Python code for shuffling; and, (b) how to shuffle was a very popular FAQ on the Tutor list. So the demand, and the difficulty of rolling your own, were compellingly clear at the time.
In contrast, people asking how to get a random k- combination are almost conspicuous by absence, which makes the "very common use" claim hard to buy when viewed against the Python community as a whole.. The handful (an exaggeration -- I only wish there were 5 ) who egged me into writing CombGen.py at the time wanted much more than just that, and CombGen tried to meet all expressed desires at the time. I have to agree with Martin that people who would use this also want a lot of related stuff (I'm one of them).
Some of the design decisions here remain unclear. Where CombGen went out of its way to guarantee that combinations are always delivered in "ascending" order, you seem to want to guarantee that they appear in a random order. Why? Especially since you view these as index vectors, ascending order gives the best shot at locality of reference when the user does the indirect indexing bit.
People who intend to use the result as a random starting point into the lexicographic or Gray code ordering of k- combinations also need ascending order.
CombGen never went into the std library because I never made an attempt to put it there: CombGen never attracted a signficant audience, and I'm not keen to push things into the library that, as far as I can tell, only a few people use.
Since that's the std I hold myself to, it's also the std I'm inclined to hold others to.
In the absence of being able to point to potential users from c.l.py threads, let me ask why you wrote it. Did you have an actual app that needed this function (and if so, what was it), or was it more of an interesting programming exercise?
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 12:37
Logged In: YES user_id=80475
I use the routine for transaction testing in audit work.
The random order is useful so that subslices of the result are also valid random samples. I run a sample of 60, test the first 25, if an error is found, the sample expands to 60, and if more errors are found, the transaction set does not pass the audit.
The cookbook poster also needed the routine in his work and wanted it badly enough to make an excrutiating tranlation from old Fortran code from a textbook. To save bungled re-inventions of the wheel, I crafted a cleaner solution than either my quick and dirty or his translated version.
Author: Guido van Rossum (gvanrossum) *
Date: 2002-11-08 13:19
Logged In: YES user_id=6380
Still, the question remains, why are all these functions so disconnected in their interface. Why does shuffle() take an optional random() function as argument? Why doesn't sample() take a list from which it returns a sample? Why isn't sample() a generator? Etc. These aren't necessarily good questions, but without trying to use these functions, I can't tell. The APIs look pretty random. Maybe the random() module is destined to be a random collection of useful statistical hacks? It already looks like that to me now. If that's the case, I'm not against adding some more, but I wish that Raymond would look at Tim's code and suggestions (e.g. complement selection for k > n/2). It does seem to me that a random sample falls in the same category as Tim's "generate all samples" code though, so arguably Raymond's sample() would belong in random.py even if CombGen.py were in the standard library. Also consider that many uses of random() are inspired by education -- for some reason, teachers like to teach programming using the random() function and its derivatives to write simple games (number guessing), visual effects (brownian motion) and more. random.sample() might well fit in that category. Another potential use category could be simple applied statistics, like Raymond's transaction testing. It seems that such things fill some kind of need (otherwise there wouldn't be two cookbook recipes for it).
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 14:01
Logged In: YES user_id=80475
FWIW, I did try out the complement selection method for k>n/2 but found that it improved performance in some cases and worsened it in others. More importantly, it interfered with the goal of returning the selections in random order. Select 10 raffle winners, give a grand prize, 2 second prizes, 3 third prizes, and 4 fourth prizes -- the results must be in random order so that the grand prize is not biased by a non-random ordering.
If everyone prefers sample(sequence, k) to sample(n,k), I will be happy to change it.
If Tim wants to send me some code to study, that's cool. I always learn something from reading his code.
Author: Guido van Rossum (gvanrossum) *
Date: 2002-11-08 17:06
Logged In: YES user_id=6380
Tim's code is at
http://mail.python.org/pipermail/python-dev/2002-August/028399.html
If you really need the selection in random order, wouldn't it make more sense to apply shuffle() to the resulting list? (Applying sort() to the list if you don't want it randomized seems backwards.)
I do find returing a list of indices less intuitive than a list of elements.
Author: Tim Peters (tim.peters) *
Date: 2002-11-08 18:20
Logged In: YES user_id=31435
Guido, you may recall that you used combgen in the Mankato project (to generate random, non-overlapping 5(?)- word "fingerprints" from email msgs). There are certainly valid uses for this stuff, and good algorithms aren't easy.
combgen resolved the range(n) vs sequence "dilemma" by providing both, where the former was primarily for speed freaks, and the latter was implemented via has-a of the former. Both are useful, and the former is essential in some cases (e.g., picking 3 out of a billion -- as Raymond says, you can't well materialize an explicit list of a billion elements first). So as a basic building block, range(n) is more useful. OTOH, users often don't see how to build what they want out of basic blocks.
About random vs sorted, Raymond provided a plausible use case. Nobody brought that up when I was doing combgen, but it's another thing different apps may want done differently. Purely from an efficiency view, it's quicker not to guarantee ascending order (combgen sorts under the covers), so in that way Raymond's range(n) gimmick is even more of a speed-freak basic building block than combgen's CombGenBasic class.
It's always a puzzle figuring out where things belong.
combgen didn't start life doing random combinations -- it
started because merely computing the number of k-
combinations (of n things) is a frequent question (how
many poker hands are there? bridge hands?), and an
efficient algorithm for computing that isn't obvious either.
Start from there, and it's soon apparent that there are
many algorithms involving combinations, so much so that if
you're working in this area, a class capturing the concept is
very useful.
Ideally, Python would have a package for combinatorial objects, and modules therein would tackle combinations, permutations, partitions, and possibly basic graph algorithms. combgen was meant to be a start at that, but it ended there too.
So that's a mild dilemma: if we put one of these in, a small but probably growing user base will want "more of the same", and random.py isn't even arguably the right place to put any of the rest.
As to how straightforward even this is, I expect this is the only patch in Python history to have 10 versions attached .
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 19:23
Logged In: YES user_id=80475
As requested, revised patch to accept a population sequence instead of an index range.
Now that xrange() is fixed (a separate issue), this patch will also serve to choose from large integer sequences without building the whole sequence first: sample(xrange (10000000), 60).
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 19:43
Logged In: YES user_id=80475
P.S. The code continues to use the index list internally. This leaves the original pool unmolested and allows the use of xrange(n) as an argument.
By not using the population elements as dictionary keys, no assumptions need to be made about the uniqueness of the population list. A weighted population is valid: sample('red red red blue blue'.split(), 3)
Author: Tim Peters (tim.peters) *
Date: 2002-11-08 20:05
Logged In: YES user_id=31435
I'd rather you went back to the original scheme -- as a "speed-freak basic building block", sticking to implicit range(n) was clear, and nobody who wants that behavior is going to guess that passing xrange(n) might work in the new scheme.
If random order is a promise of this method, than that must be documented. As is, the docs are silent about order, so any order meets the spec. If it's important that it be random, then the docs have to constrain implementations; if it's not important, you can't use it as an argument .
The return type isn't documented and should be, esp. if you want to stick to the new scheme. That it always returns a list will be surprising (if I pass, e.g., a string, I expect a string of length k to come back; or if a tuple, a tuple of length k, etc. -- this became clear from combgen's users, and is another reason sticking to the basic building block function is better -- we put this in, and next thing is a feature request to return a sequence of the same type as the input).
Comments about use case subtleties, and algorithm obscurities, belong in the docs and in code comments more than in patch comments.
You surely don't want to hear this next one , but the patch appears to be missing test cases.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-08 22:21
Logged In: YES user_id=80475
Done. Added revised patches for sample(population,k) and for sample(n,k). Take your pick.
FYI, to interpret the generator test, the expected standard deviation for a uniform distribution is sqrt(((n**2)-1) / 12).
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-09 03:02
Logged In: YES user_id=80475
Neatened-up the patch for random.sample(population,k). Sped the tests, eliminated the final map, and clarified the docs. Using xrange(n) as an argument is shown in both the docs and docstring so that people won't have to be clever or original.
I think this one is ready for prime time and would be a happy fellow if it got blessed.
Author: Guido van Rossum (gvanrossum) *
Date: 2002-11-09 03:57
Logged In: YES user_id=6380
But I thought Tim recommends sample(n, k)?
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-09 05:49
Logged In: YES user_id=80475
Tim, which do you prefer?
Rand6.diff is on the lauch pad, ready to go.
random.sample(population,k) is now as lean and mean as
sample(n,k); the xrange() idiom is thoroughly documented
and tested; and the sample(population,k) approach is now
my favorite.
Still, rand5.diff is also ready to go. It documents how to convert from indices to elements.
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-10 02:48
Logged In: YES user_id=80475
Can I commit rand6.diff and be done with this one?
At one time, sample(n,k) looked better because the code was simpler, faster, and the use of xrange(n) in sample (population,k) wasn't obvious.
As of rand6.diff, sample(population,k) is equally fast and simple. The use of xrange(n) is thoroughly documented and has no performance penalty.
It's now faster and easier to express sample(n,k) in terms of sample(population,k) than vice-versa. Also, sample (population,k) has the friendlier interface. So, it is the one I recommend.
Author: Guido van Rossum (gvanrossum) *
Date: 2002-11-11 14:39
Logged In: YES user_id=6380
Tim, are you still hesitant? I think this is fine.
Author: Tim Peters (tim.peters) *
Date: 2002-11-11 15:37
Logged In: YES user_id=31435
Sorry, I have a lot on my plate, and this one overshot its budget by an hour already. I'll get back to it later today.
Author: Tim Peters (tim.peters) *
Date: 2002-11-11 21:35
Logged In: YES user_id=31435
Accepted, and back to Raymond <whew!>. rand6,diff it is.
Comments:
I doubt the xrange trick will work in Python 3, but time enough for that in coming years.
It's awfully obscure why "return pool[-k:]" can't do a wrong thing when k is 0.
Vertical space isn't at a premium here -- no need to squash if and if-controlled code onto the same line.
The test code will never be run (nobody runs random.py). That's not your fault. We should think about making a real test out of random.py's quarter-attempt at testing itself.
It looks to be a pleasant new facility. Good job!
Author: Raymond Hettinger (rhettinger) *
Date: 2002-11-12 17:52
Logged In: YES user_id=80475
Committed as random.py 1.36 and librandom.tex 1.31.
Thanks Tim, Martin, and GvR for the reviews.
Tim' comments:
If xrange disappears in Py3, objects defining getitem will live on.
When k is zero, the block with "return pool[:k]" is never executed; the dictionary block runs instead.
Expanded single line ifs into multi-line
The tests in random.py get run only when we are maintaining the module. When I put in the Mersenne Twister, will make separate unittests that get run all the time and that test the module thoroughly.