[Python-ideas] csv.DictReader could handle headers more intelligently. (original) (raw)

J. Cliff Dyer jcd at sdf.lonestar.org
Wed Jan 23 18:37:01 CET 2013


On Wed, 2013-01-23 at 18:08 +0100, Amaury Forgeot d'Arc wrote:

Hi,

2013/1/23 J. Cliff Dyer <jcd at sdf.lonestar.org> On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote: > I don't think we should start adding support for every malformed type > of csv file that exists. It's easy enough to remove the unnecessary > lines yourself before passing them to DictReader: > > from csv import DictReader > > with open('malformed.csv','rb') as csvfile: > csvlines = list(l for l in csvfile if l.strip()) > csvreader = DictReader(csvlines) > > Personally, if I was dealing with this as often as you are, I'd > probably make a custom context manager instead. The problem lies in > the files themselves, not in csv's response to them. _> ________________________ > Python-ideas mailing list > Python-ideas at python.org > http://mail.python.org/mailman/listinfo/python-ideas >

With all due respect, while you make a good point that we don't want to start special casing every malformed type of CSV, there is absolutely something wrong with DictReader's response to files that have duplicate headers. It throws away data silently. That's how Python dictionaries work, by design: d = {'a': 1, 'a': 2} "silently" discards the first value. If you (and others on this list) aren't in favor of trying to find the right header row (which I can understand: "In the face of ambiguity, refuse the temptation to guess."), maybe a better solution would be to raise a (suppressible) exception if the headers aren't uniquely named. ("Errors should never pass silently. Unless explicitly silenced.") What about a subclass then: class CarefulDictReader(csv.DictReader): def init(self, *args, **kwargs): super().init(*args, **kwargs) fieldnames = self.fieldnames if len(fieldnames) != len(set(fieldnames)): raise ValueError("Duplicate field names", fieldnames)

-- Amaury Forgeot d'Arc

Whether it's a subclass or a change to the existing class is worth having a discussion about. Obviously, the change could be made in a subclass. Currently, that's what I do. The question at issue is whether it should be made in the original. My position is that something should change in the standard library, whether that is modifying the code in some way to handle edge cases more robustly, or updating the documentation to advise programmers on how to handle files that aren't perfectly formed.

This might include documenting that self.reader is an available attribute (where the programmer could iterate to find the header row they're looking for, if needed, and then assign it to self.fieldnames).

I do like the idea of assigning the fieldnames variable and then raising the ValueError, so if the user silences the exception, they still have access to the field names found. However, I think the behavior should be overridden on the fieldnames property, so as not to change the semantics of the DictReader.



More information about the Python-ideas mailing list