Issue 1818: Add named tuple reader to CSV module (original) (raw)
Created on 2008-01-13 22:27 by rhettinger, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (42)
Author: Raymond Hettinger (rhettinger) *
Date: 2008-01-13 22:27
Here's a proof-of-concept patch. If approved, will change from generator form to match the other readers and will add a test suite.
The idea corresponds to what is currently done by the dict reader but returns a space and time efficient named tuple instead of a dict. Field order is preserved and named attribute access is supported.
A writer is not needed because named tuples can be feed into the existing writer just like regular tuples.
Author: Raymond Hettinger (rhettinger) *
Date: 2008-01-22 19:25
Barry, any thoughts on this?
Author: Skip Montanaro (skip.montanaro) *
Date: 2008-01-22 20:12
I'd personally be kind of surprised if Barry had any thoughts on this. Is there any reason this couldn't be pushed down into the C code and replace the normal tuple output completely? In the absence of any fieldnames you could just dream some up, like "field001", "field002", etc.
Skip
Author: Jervis Whitley (jdwhitley)
Date: 2009-02-09 09:24
An implementation of a namedtuple reader and writer.
Created a writer for the case where user would like to specify desired field names and default values on missing field names.
e.g. mywriter = NamedTupleWriter(f, fieldnames=['f1', 'f2', 'f3'], restval='missing')
Nt = namedtuple('LessFields', 'f1 f3') nt = Nt(f1='one', f2=2)
mywriter.writerow(nt) # writes one,missing,2
any thoughts on case where defined fieldname has a leading underscore? Should there be a flag to silently ignore?
e.g. if self.ignore_underscores: fieldname = fieldname.lstrip('')
Leading underscores may be present in an unsighted csv file, additionally, spaces and other non alpha numeric characters pose a problem that does not affect the DictReader class.
Cheers,
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-09 16:53
Consider providing a hook to a function that converts non-conforming field names (ones with a leading underscore, leading digit, non-letter, keyword, or duplicate name).
class NamedTupleReader: def init(self, f, fieldnames=None, restkey=None, restval=None, dialect="excel", fieldnamer=None, *args, **kwds): . . .
I'm going to either post a recipe to do the renaming or provide a static method for the same purpose. It might work like this:
renamer(['abc', 'def', '1', '_hidden', 'abc', 'p', 'abc']) ['abc', 'x_def', 'x_1', 'x_hidden', 'x_abc', 'p', 'x1_abc']
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-10 01:25
In r69480, named tuples gained the ability to automatically rename invalid fieldnames.
Author: Jervis Whitley (jdwhitley)
Date: 2009-02-10 11:08
Updated NamedTupleReader to give a rename=False keyword argument. rename is passed directly to the namedtuple factory function to enable automatic handling of invalid fieldnames.
Two new tests for the rename keyword.
Cheers,
Author: Rob Renaud (rrenaud)
Date: 2009-02-26 07:38
I am totally new to Python dev. I reinvented a NamedTupleReader tonight, only to find out that it was created a year ago. My primary motivation is that DictReader reads headers nicely, but DictWriter totally sucks at handling them.
Consider doing some filtering on a csv file, like so.
sample_data = [ 'title,latitude,longitude', 'OHO Ofner & Hammecke Reinigungsgesellschaft mbH,48.128265,11.610848', 'Kitchen Kaboodle,45.544241,-122.715728', 'Walgreens,28.339727,-81.596367', 'Gurnigel Pass,46.731944,7.447778' ]
def filter_with_dict_reader_writer(): accepted_rows = [] for row in csv.DictReader(sample_data): if float(row['latitude']) > 0.0 and float(row['longitude']) > 0.0: accepted_rows.append(row)
field_names = csv.reader(sample_data).next() output_writer = csv.DictWriter(open('accepted_by_dict.csv', 'w'), field_names) output_writer.writerow(dict(zip(field_names, field_names))) output_writer.writerows(accepted_rows)
You have to work so hard to maintain the headers when you write the file with DictWriter. I understand this is a limitation of dicts throwing away the order information. But namedtuples don't have that problem.
NamedTupleReader and NamedTupleWriter should be inverses. This means that NamedTupleWriter needs to write headers. This should produce identical output as the dict writer example, but it's much cleaner.
def filter_with_named_tuple_reader_writer(): accepted_rows = [] for row in csv.NamedTupleReader(sample_data): if float(row.latitude) > 0.0 and float(row.longitude) > 0.0: accepted_rows.append(row)
output_writer = csv.NamedTupleWriter( open('accepted_by_named_tuple.csv', 'w')) output_writer.writerows(accepted_rows)
I patched on top of the existing NamedTupleWriter patch adding support for writing headers. I don't know if that's bad style/etiquette, etc.
Author: Rob Renaud (rrenaud)
Date: 2009-02-26 07:59
My previous patch could write the header twice. But I am not sure about about how the writer should handle the fieldnames parameter on one hand, and the namedtuple._fields on the other.
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-26 08:01
The two latest patches (ntreader4.diff and named_tuple_write_header.patch) seem like they are going in the right direction and are getting close.
Barry or Skip, is this something you want in your module?
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-26 15:44
Raymond> Barry or Skip, is this something you want in your module?
Sorry, I haven't really looked at this ticket other than to notice its presence. I wrote the DictReader/DictWriter functions way back when, so I'm pretty comfortable using them. I haven't felt the need for any other reader or writer which manipulates file headers.
Skip
Author: Barry A. Warsaw (barry) *
Date: 2009-02-26 15:47
I think it would be useful to have.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-26 19:02
Hrm... I replied twice by email. Only one comment appears to have survived the long trip. Here's my second reply:
Rob> NamedTupleReader and NamedTupleWriter should be inverses. This
Rob> means that NamedTupleWriter needs to write headers. This should
Rob> produce identical output as the dict writer example, but it's much
Rob> cleaner.
You're assuming that one instance of these classes will read or write an entire file. What if you want to append lines to an existing CSV file or pick up reading a file with a new reader which has already be partially processed?
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-26 19:04
Let me be more explicit. I don't know how it implements it, but I think you really need to give the user the option of specifying the field names and not reading/writing headers. It can't be implicit as I interpreted Rob's earlier comment:
> NamedTupleReader and NamedTupleWriter should be inverses.
> This means that NamedTupleWriter needs to write headers.
Skip
Author: Jervis Whitley (jdwhitley)
Date: 2009-02-26 21:02
Skip> Let me be more explicit. I don't know how it implements it, but I think Skip> you really need to give the user the option of specifying the field Skip> names and not reading/writing headers. It can't be implicit as I Skip> interpreted Rob's earlier comment:
rrenaud> NamedTupleReader and NamedTupleWriter should be inverses.
rrenaud> This means that NamedTupleWriter needs to write headers.
I agree with Skip, we mustn't have a 'wroteheader' flag internal to the NamedTupleWriter.
Currently to write a 'header' row with a csv.writer you could (for example) pass a tuple of header names to writerow. NamedTupleWriter is no different, you would have a namedtuple of header names instead of a tuple of header names.
I would not like to see another flag added to the initialisation process to enable the writing of a header row as the 'first' (or any) row written to a file. We could add a function 'writeheader' that would write the contents of 'fieldnames' as a row, but I don't like the idea.
Cheers,
Author: Rob Renaud (rrenaud)
Date: 2009-02-26 22:18
I want to make sure I understand. Am I correct in believing that Skip thinks writing headers should be optional, while Jervis believes we should leave the burden to the NamedTupleWriter client?
I agree that we should not unconditionally write headers, but I think that we should write headers by default, much like we read them by default.
I believe the implicit header writing is very elegant, and the only reason that the DictWriter object doesn't write headers is the impedance mismatch between dicts and CSV. namedtuples has the field order information, the impedance mismatch is gone, we should no longer be hindered. Implicitly reading but not explicitly writing headers just seems wrong.
It also seems wrong to require the construction of "header" namedtuple objects. It's much less natural than dicts holding identity mappings.
Point._make(Point._fields) Point(x='x', y='y')
To me, that just looks weird and non-obvious to me. That Point instance doesn't really fit in my mind as something that should be a Point.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-27 00:55
Rob> I agree that we should not unconditionally write headers, but I Rob> think that we should write headers by default, much like we read Rob> them by default.
I don't think you should write them by default. I've worked with lots of CSV files which have no headers. I can imagine people wanting to write CSV files with multiple headers. It should be optional and explicit.
Skip
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-27 01:00
More concretely, I don't think this is so onerous:
names = ["col1", "col2", "color"]
writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
writer.writerow(dict(zip(names, names)))
...
or
f = open("f.csv", "rb")
names = csv.reader(f).next()
reader = csv.DictReader(f, fieldnames=names, ...)
...
Skip
Author: Rob Renaud (rrenaud)
Date: 2009-02-27 02:16
I did a search on Google code for the DictReader constructor. I analyzed the first 3 pages, the fieldnames parameter was used in 14 of 27 cases (discounting unittest code built into Python) and was not used in 13 of 27 cases. I suppose that means headered csv files are sufficiently rare that they shouldn't be created implicitly by default. I still don't like the lack of symmetry of supporting implicit header reads, but not implicit header writes.
On Thu, Feb 26, 2009 at 8:00 PM, Skip Montanaro <report@bugs.python.org> wrote:
Skip Montanaro <skip@pobox.com> added the comment:
More concretely, I don't think this is so onerous:
names = ["col1", "col2", "color"] writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...) writer.writerow(dict(zip(names, names))) ...
or
f = open("f.csv", "rb") names = csv.reader(f).next() reader = csv.DictReader(f, fieldnames=names, ...) ...
Skip
Python tracker <report@bugs.python.org> <http://bugs.python.org/issue1818>
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-27 02:43
I don't think you should write them by default.
I've worked with lots of CSV files which have no headers.
My experience has been the same as Skips.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-02-27 04:45
Rob> I still don't like the lack of symmetry of supporting implicit Rob> header reads, but not implicit header writes.
A header is nothing more than a row in the CSV file with special interpretation applied by the user. There is nothing implicit about it. If you know the first line is a header, use the recipe I posted. If not, supply your own fieldnames and treat the first row as data.
Skip
Author: Jervis Whitley (jdwhitley)
Date: 2009-03-08 04:34
Added a patch against py3k branch.
in csv.rst removed reference to reader.next() as a public method.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-03-08 04:40
Jervis> in csv.rst removed reference to reader.next() as a public method.
Because? I've not seen any discussion in this issue or in any other forums (most certainly not on the csv@python.org mailing list) which would suggest that csv.reader's next() method should no longer be a public method.
Skip
Author: Antoine Pitrou (pitrou) *
Date: 2009-03-08 13:33
I don't understand why NamedTupleReader requires the fieldnames array rather than the namedtuple class itself. If you could pass it the namedtuple class, users could choose whatever namedtuple subclass with whatever additional methods or behaviour suits them. It would make NamedTupleReader more flexible and more useful.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-03-08 19:13
I don't know how NamedTuple objects work, but in many situations you want the content of the CSV file to drive the output. I would think you would use a technique similar to my DictReader example to tell the NamedTupleReader the fieldnames. For that you need a fieldnames argument.
Author: Skip Montanaro (skip.montanaro) *
Date: 2009-03-08 19:21
I retract my previous comment. I don't use the DictReader the way it operates (fieldnames==None => first row is a header) and forgot about that behavior.
Author: Jervis Whitley (jdwhitley)
Date: 2009-03-08 22:53
Jervis> in csv.rst removed reference to reader.next() as a public method.
Skip> Because? I've not seen any discussion in this issue or in any Skip> other forums Skip> (most certainly not on the csv@python.org mailing list) which would Skip> suggest Skip> that csv.reader's next() method should no longer be a public method.
I agree, this should be applied separately.
Author: Jervis Whitley (jdwhitley)
Date: 2009-03-08 23:13
Antoine> I don't understand why NamedTupleReader requires the Antoine> fieldnames array Antoine> rather than the namedtuple class itself. If you could pass it Antoine> the namedtuple class, users could choose whatever namedtuple Antoine> subclass with whatever additional methods or behaviour suits Antoine> them. It would make NamedTupleReader more flexible and more Antoine> useful.
The NamedTupleReader does take the namedtuple class as the fieldnames argument. It can be a namedtuple, a 'fieldnames' array or None. If a namedtuple is used as the fieldnames argument, returned rows are created using ._make from the this namedtuple. Unless I have read your requirements incorrectly, this is the behaviour you describe.
Given the confusion, I accept that the documentation needs to be improved.
The NamedTupleReader and Writer were created to follow as closely as possible the behaviour (and signature) of the DictReader and DictWriter, with the exception of using namedtuples instead of dicts.
Author: Antoine Pitrou (pitrou) *
Date: 2009-03-08 23:21
Ok, I got misled by the documentation ("The contents of fieldnames are passed directly to be used as the namedtuple fieldnames"), and your implementation is a bit difficult to follow.
Author: Jervis Whitley (jdwhitley)
Date: 2009-03-09 00:06
Updated version of docs for 2.7 and 3k.
Author: Γric Araujo (eric.araujo) *
Date: 2010-04-12 10:57
See also this python-ideas thread: http://mail.python.org/pipermail/python-ideas/2010-April/006991.html
Author: Skip Montanaro (skip.montanaro) *
Date: 2010-04-12 17:08
Type conversion is a whole 'nuther kettle of fish. This particular thread is long and complex enough that it shouldn't be made more complex.
Author: Mark Lawrence (BreamoreBoy) *
Date: 2010-07-17 19:18
I suggest that this is closed unless anyone shows an active interest in it.
Author: Mark Lawrence (BreamoreBoy) *
Date: 2010-07-25 08:38
Closing as no response to .
Author: Raymond Hettinger (rhettinger) *
Date: 2010-07-25 19:12
Re-opening because we ought to do something along these lines at some point. The DictReader and DictWriter are inadequate for preserving order and they are unnecessarily memory intensive (one dict per record).
FWIW, the non-conforming field name problem has already been solved by recent improvements to collections.namedtuple using rename=True.
Author: Raymond Hettinger (rhettinger) *
Date: 2010-09-02 00:33
Unassigning, this needs fresh thought and a fresh patch from someone who can devote a little deep thinking on how to solve this problem cleanly. In the meantime, it is no problem to simply cast the CSV tuples into named tuples.
Author: Daniel Lenski (dlenski) *
Date: 2015-02-10 21:41
Here's the class I have been using for reading namedtuples from CSV files:
from collections import namedtuple
from itertools import imap
import csv
class CsvNamedTupleReader(object):
__slots__ = ('_r', 'row', 'fieldnames')
def __init__(self, *args, **kwargs):
self._r = csv.reader(*args, **kwargs)
self.row = namedtuple("row", self._r.next())
self.fieldnames = self.row._fields
def __iter__(self):
#FIXME: how about this? return imap(self.row._make, self._r[:len(self.fieldnames)]
return imap(self.row._make, self._r)
dialect = property(lambda self: self._r.dialect)
line_num = property(lambda self: self._r.line_num)
This class wraps csv.reader since it doesn't seem to be possible to inherit from it. It uses itertools.imap to iterate over the rows output by csv.reader and convert them to the namedtuple class.
One thing that needs fixing (marked with FIXME above) is what to do in the case of a row which has more fields than the header row. The simplest solution is simply to truncate such a row, but perhaps more options are needed, similar to those offered by DictReader.
Author: Ilia Kurenkov (copper-head) *
Date: 2015-04-20 04:07
As my contribution during the sprints at PyCon 2015, I've tweaked Jervis's patch a little and updated the tests/docs to work with Python 3.5.
My only real change was placing the basic reader object inside a generator expression that filters out empty lines. Being partial to functional programming I find this removes some of the code clutter in next(), letting that method focus on turning rows into tuples.
Hopefully this will rekindle the discussion!
Author: Raymond Hettinger (rhettinger) *
Date: 2015-04-20 04:12
Skip or Barry, do you want to look at this?
Author: Ilia Kurenkov (copper-head) *
Date: 2015-05-11 02:00
Friendly reminder that this exists.
I know everyone's busy and this is marked as low-priority, but I'm gonna keep bumping this till we add a solution :)
Author: Skip Montanaro (skip.montanaro) *
Date: 2015-05-11 13:56
I looked at this six years ago. I still haven't found a situation where I pined for a NamedTupleReader. That said, I have no objection to committing it if others, more well-versed in current Python code and NamedTuples than I gives it a pass. Note that I added a couple comments to the csv.py diff, but nobody either updated the code or explained why I was out in the weeds in my comments.
Author: Skip Montanaro (skip.montanaro) *
Date: 2018-01-29 22:17
FWIW, I relinquished my check-in privileges quite awhile ago. This should almost certainly no longer be assigned to me.
S
History
Date
User
Action
Args
2022-04-11 14:56:29
admin
set
github: 46143
2020-12-22 02:45:54
rhettinger
set
status: open -> closed
stage: patch review -> resolved
2018-01-29 22:17:50
skip.montanaro
set
messages: +
2018-01-29 20:57:29
rhettinger
set
priority: low -> normal
versions: + Python 3.8, - Python 3.5
2015-05-29 06:50:46
ced
set
nosy: + ced
2015-05-11 13:56:54
skip.montanaro
set
messages: +
2015-05-11 02:00:23
copper-head
set
messages: +
2015-04-20 04:12:03
rhettinger
set
versions: - Python 3.3
nosy: + skip.montanaro
messages: +
assignee: skip.montanaro
stage: needs patch -> patch review
2015-04-20 04:07:57
copper-head
set
files: + 1818_py35.diff
versions: + Python 3.5
nosy: + copper-head
messages: +
2015-02-10 21:41:01
dlenski
set
nosy: + dlenski
messages: +
2014-02-03 19:05:18
BreamoreBoy
set
nosy: - BreamoreBoy
2012-12-14 03:53:54
asvetlov
set
nosy: + asvetlov
2012-09-06 12:06:51
ainur0160
set
nosy: - ainur0160
2012-09-06 11:48:56
ainur0160
set
nosy: + ainur0160
2010-09-02 00:33:39
rhettinger
set
priority: normal -> low
versions: - Python 3.2
messages: +
assignee: rhettinger -> (no value)
stage: patch review -> needs patch
2010-07-25 19:12:12
rhettinger
set
status: closed -> open
assignee: barry -> rhettinger
messages: +
2010-07-25 08:38:11
BreamoreBoy
set
status: pending -> closed
messages: +
2010-07-17 19π15
BreamoreBoy
set
status: open -> pending
versions: + Python 3.2, Python 3.3, - Python 3.1, Python 2.7
nosy: + BreamoreBoy
messages: +
2010-05-20 20:38:50
skip.montanaro
set
nosy: - skip.montanaro
2010-04-12 17:08:27
skip.montanaro
set
messages: +
2010-04-12 10:57:47
eric.araujo
set
nosy: + eric.araujo
messages: +
2009-03-09 00:08:01
jdwhitley
set
files: - ntreader5_py3_1.diff
2009-03-09 00:07:46
jdwhitley
set
files: + ntreader6_py27.diff
2009-03-09 00:07:04
jdwhitley
set
files: + ntreader6_py3.diff
messages: +
2009-03-08 23:21:31
pitrou
set
messages: +
2009-03-08 23:13:33
jdwhitley
set
messages: +
2009-03-08 22:53:06
jdwhitley
set
files: + ntreader5_py3_1.diff
messages: +
2009-03-08 19:21:00
skip.montanaro
set
messages: +
2009-03-08 19:19:42
skip.montanaro
set
messages: -
2009-03-08 19π15
skip.montanaro
set
messages: +
2009-03-08 19:13:24
skip.montanaro
set
messages: +
2009-03-08 13:33:50
pitrou
set
nosy: + pitrou
messages: +
2009-03-08 04:40:59
skip.montanaro
set
messages: +
2009-03-08 04:34:07
jdwhitley
set
files: + ntreader4_py3_1.diff
messages: +
2009-02-27 04:45:49
skip.montanaro
set
messages: +
2009-02-27 02:43:22
rhettinger
set
messages: +
2009-02-27 02:16:55
rrenaud
set
messages: +
2009-02-27 01:00:23
skip.montanaro
set
messages: +
2009-02-27 00:55:40
skip.montanaro
set
messages: +
2009-02-26 22π40
rrenaud
set
messages: +
2009-02-26 21:02:58
jdwhitley
set
messages: +
2009-02-26 19:04:27
skip.montanaro
set
messages: +
2009-02-26 19:02:03
skip.montanaro
set
messages: +
2009-02-26 15:47:25
barry
set
messages: +
2009-02-26 15:44:54
skip.montanaro
set
messages: +
2009-02-26 08:01:10
rhettinger
set
type: enhancement
stage: patch review
messages: +
versions: + Python 3.1, Python 2.7, - Python 2.6
2009-02-26 07:59:24
rrenaud
set
files: - named_tuple_write_header.patch
2009-02-26 07:59:15
rrenaud
set
files: + named_tuple_write_header2.patch
messages: +
2009-02-26 07:38:36
rrenaud
set
files: + named_tuple_write_header.patch
nosy: + rrenaud
messages: +
2009-02-10 11:08:09
jdwhitley
set
files: + ntreader4.diff
messages: +
2009-02-10 01:25:16
rhettinger
set
messages: +
2009-02-09 16:54:00
rhettinger
set
messages: +
2009-02-09 09:25:00
jdwhitley
set
files: + ntreader3.diff
nosy: + jdwhitley
messages: +
keywords: + patch
2008-01-22 20:12:41
skip.montanaro
set
nosy: + skip.montanaro
messages: +
2008-01-22 19:25:56
rhettinger
set
messages: +
2008-01-13 22:27:14
rhettinger
create