(original) (raw)

Hi,

2013/1/23 J. Cliff Dyer <jcd@sdf.lonestar.org>

On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:
\> I don't think we should start adding support for every malformed type
\> of csv file that exists. It's easy enough to remove the unnecessary
\> lines yourself before passing them to DictReader:
\>
\> � � from csv import DictReader
\>
\> � � with open('malformed.csv','rb') as csvfile:
\> � � � � csvlines = list(l for l in csvfile if l.strip())
\> � � � � csvreader = DictReader(csvlines)
\>
\> Personally, if I was dealing with this as often as you are, I'd
\> probably make a custom context manager instead. The problem lies in
\> the files themselves, not in csv's response to them.
\> \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
\> Python-ideas mailing list
\> Python-ideas@python.org
\> http://mail.python.org/mailman/listinfo/python-ideas
\>

With all due respect, while you make a good point that we don't want to
start special casing every malformed type of CSV, there is absolutely
something wrong with DictReader's response to files that have duplicate
headers. It throws away data silently.

That's how Python dictionaries work, by design:

� � d = {'a': 1, 'a': 2}

"silently" discards the first value.

If you (and others on this list) aren't in favor of trying to find the
right header row (which I can understand: "In the face of ambiguity,
refuse the temptation to guess."), maybe a better solution would be to
raise a (suppressible) exception if the headers aren't uniquely named.
("Errors should never pass silently. �Unless explicitly silenced.")

What about a subclass then:

class CarefulDictReader(csv.DictReader):

� � def \_\_init\_\_(self, \*args, \*\*kwargs):

� � � � super().\_\_init\_\_(\*args, \*\*kwargs)

� � � � fieldnames = self.fieldnames

� � � � if len(fieldnames) != len(set(fieldnames)):

� � � � � � raise ValueError("Duplicate field names", fieldnames)

--
Amaury Forgeot d'Arc