Issue 4615: de-duping function in itertools (original) (raw)

Created on 2008-12-10 02:47 by thomaspinckney3, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg77477 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-12-10 02:47
Any interest in an itertools de-duping function? I find I have to write this over and over for different projects: def deduped(iter,key=None): keys = set() for x in iter: if key: k = key(x) else: k = x if k in keys: continue keys.add(k) yield(x)
msg77481 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-10 03:06
Am curious about your use cases. In what contexts are you needing to eliminate duplicates but retain the original ordering? Am wondering if the proposals for an ordered dictionary will meet your needs? Also, am looking to see what typifies the balance of needs your use cases -- an iterator so large that you wouldn't want to loop over it twice but a set of unique items that doesn't grow especially large (see the code random.sample for a similar situation). Another thought is that the proposal for a Bag class (multiset) may solve the problem more generally.
msg77482 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-12-10 03:47
My latest need for something like this was something like this: src1 = db_query(query_1) src2 = db_query(query_2) results = deduped(src1 + src2, key=lambda x: x.field2) Basically, I wanted data from src1 if it existed and otherwise from src2 , while preserving the order of src1 (I didn't care about order of src2). A previous example was reading from a file and wanting to de-dupe lines based on a field in each line. Again order mattered to me since I wanted to process the non-duped lines in the file in order. A final example was generating a bunch of error messages from a variety of sources and then wanting to make sure there were no duplicate errors. Instead of: errors = set(errors) I find this much clearer: errors = deduped(errors) In reality all of these examples probably do not need to be written as a generator. The lists being de-duped are probably not so huge in practice as to preclude instantiating a new list (given the reality of multi-gig RAM machines etc). It just seemed particularly clear to write this using a yield. An ordered dictionary would probably work for me too. I don't think a Bag would given it's lack of ordering. I do find it very simple to just be able to apply deduped() to any existing sequence/iterator and not have to be more verbose about explicitly iterating and filling in an ordered dictionary somehow.
msg78015 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-18 08:38
My inclination is to not include this as a basic C coded itertool because it holds potentially all of the data in memory (generally, not a characteristic of an itertool) and because I don't see it as a basic building block (itertools are intended to be elemental, composable components of an iterator algebra). Also, the pure python equivalent of dedup() is both easy to write and runs efficiently (it gains little from being recoded in C). Instead, I'm think of adding two recipes to the itertools docs: def unique_everseen(iterable, key=None): # unique_everseen('AAAABBBCCDAABBB') --> A B C D # unique_everseen('ABBCcAD', str.lower) --> A B C D seen = set() seen_add = seen.add if key is None: for elem in iterable: if elem not in seen: seen_add(elem) yield elem else: for elem in iterable: k = key(elem) if k not in seen: seen_add(k) yield elem def unique_lastseen(iterable, key=None): # unique_lastseen('AAAABBBCCDAABBB') --> A B C D A B # unique_lastseen('ABBCcAD', str.lower) --> A B C A D return imap(next, imap(itemgetter(1), groupby(iterable, key)))
msg78029 - (view) Author: David W. Lambert (LambertDW) Date: 2008-12-18 15:44
(but of course with imap in version 2.7 and with something else in version 3.x)
msg78878 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-01-02 21:21
Added recipes to the itertools docs: r68177 .
History
Date User Action Args
2022-04-11 14:56:42 admin set github: 48865
2009-01-02 21:21:32 rhettinger set status: open -> closedresolution: acceptedmessages: +
2008-12-18 15:44:00 LambertDW set nosy: + LambertDWmessages: +
2008-12-18 08:38:20 rhettinger set messages: +
2008-12-10 03:47:14 thomaspinckney3 set messages: +
2008-12-10 03:06:15 rhettinger set messages: + versions: + Python 3.1, Python 2.7, - Python 2.6
2008-12-10 03:00:13 benjamin.peterson set assignee: rhettingernosy: + rhettinger
2008-12-10 02:47:59 thomaspinckney3 create