Issue 1818: Add named tuple reader to CSV module (original) (raw)

Created on 2008-01-13 22:27 by rhettinger, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (42)

msg59866 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2008-01-13 22:27

Here's a proof-of-concept patch. If approved, will change from generator form to match the other readers and will add a test suite.

The idea corresponds to what is currently done by the dict reader but returns a space and time efficient named tuple instead of a dict. Field order is preserved and named attribute access is supported.

A writer is not needed because named tuples can be feed into the existing writer just like regular tuples.

msg61523 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2008-01-22 19:25

Barry, any thoughts on this?

msg61532 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2008-01-22 20:12

I'd personally be kind of surprised if Barry had any thoughts on this. Is there any reason this couldn't be pushed down into the C code and replace the normal tuple output completely? In the absence of any fieldnames you could just dream some up, like "field001", "field002", etc.

Skip

msg81453 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-02-09 09:24

An implementation of a namedtuple reader and writer.

Created a writer for the case where user would like to specify desired field names and default values on missing field names.

e.g. mywriter = NamedTupleWriter(f, fieldnames=['f1', 'f2', 'f3'], restval='missing')

Nt = namedtuple('LessFields', 'f1 f3') nt = Nt(f1='one', f2=2)

mywriter.writerow(nt) # writes one,missing,2

any thoughts on case where defined fieldname has a leading underscore? Should there be a flag to silently ignore?

e.g. if self.ignore_underscores: fieldname = fieldname.lstrip('')

Leading underscores may be present in an unsighted csv file, additionally, spaces and other non alpha numeric characters pose a problem that does not affect the DictReader class.

Cheers,

msg81464 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-02-09 16:53

Consider providing a hook to a function that converts non-conforming field names (ones with a leading underscore, leading digit, non-letter, keyword, or duplicate name).

class NamedTupleReader: def init(self, f, fieldnames=None, restkey=None, restval=None, dialect="excel", fieldnamer=None, *args, **kwds): . . .

I'm going to either post a recipe to do the renaming or provide a static method for the same purpose. It might work like this:

renamer(['abc', 'def', '1', '_hidden', 'abc', 'p', 'abc']) ['abc', 'x_def', 'x_1', 'x_hidden', 'x_abc', 'p', 'x1_abc']

msg81518 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-02-10 01:25

In r69480, named tuples gained the ability to automatically rename invalid fieldnames.

msg81537 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-02-10 11:08

Updated NamedTupleReader to give a rename=False keyword argument. rename is passed directly to the namedtuple factory function to enable automatic handling of invalid fieldnames.

Two new tests for the rename keyword.

Cheers,

msg82744 - (view)

Author: Rob Renaud (rrenaud)

Date: 2009-02-26 07:38

I am totally new to Python dev. I reinvented a NamedTupleReader tonight, only to find out that it was created a year ago. My primary motivation is that DictReader reads headers nicely, but DictWriter totally sucks at handling them.

Consider doing some filtering on a csv file, like so.

sample_data = [ 'title,latitude,longitude', 'OHO Ofner & Hammecke Reinigungsgesellschaft mbH,48.128265,11.610848', 'Kitchen Kaboodle,45.544241,-122.715728', 'Walgreens,28.339727,-81.596367', 'Gurnigel Pass,46.731944,7.447778' ]

def filter_with_dict_reader_writer(): accepted_rows = [] for row in csv.DictReader(sample_data): if float(row['latitude']) > 0.0 and float(row['longitude']) > 0.0: accepted_rows.append(row)

field_names = csv.reader(sample_data).next() output_writer = csv.DictWriter(open('accepted_by_dict.csv', 'w'), field_names) output_writer.writerow(dict(zip(field_names, field_names))) output_writer.writerows(accepted_rows)

You have to work so hard to maintain the headers when you write the file with DictWriter. I understand this is a limitation of dicts throwing away the order information. But namedtuples don't have that problem.

NamedTupleReader and NamedTupleWriter should be inverses. This means that NamedTupleWriter needs to write headers. This should produce identical output as the dict writer example, but it's much cleaner.

def filter_with_named_tuple_reader_writer(): accepted_rows = [] for row in csv.NamedTupleReader(sample_data): if float(row.latitude) > 0.0 and float(row.longitude) > 0.0: accepted_rows.append(row)

output_writer = csv.NamedTupleWriter( open('accepted_by_named_tuple.csv', 'w')) output_writer.writerows(accepted_rows)

I patched on top of the existing NamedTupleWriter patch adding support for writing headers. I don't know if that's bad style/etiquette, etc.

msg82745 - (view)

Author: Rob Renaud (rrenaud)

Date: 2009-02-26 07:59

My previous patch could write the header twice. But I am not sure about about how the writer should handle the fieldnames parameter on one hand, and the namedtuple._fields on the other.

msg82746 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-02-26 08:01

The two latest patches (ntreader4.diff and named_tuple_write_header.patch) seem like they are going in the right direction and are getting close.

Barry or Skip, is this something you want in your module?

msg82764 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-26 15:44

Raymond> Barry or Skip, is this something you want in your module?

Sorry, I haven't really looked at this ticket other than to notice its presence. I wrote the DictReader/DictWriter functions way back when, so I'm pretty comfortable using them. I haven't felt the need for any other reader or writer which manipulates file headers.

Skip

msg82765 - (view)

Author: Barry A. Warsaw (barry) * (Python committer)

Date: 2009-02-26 15:47

I think it would be useful to have.

msg82770 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-26 19:02

Hrm... I replied twice by email. Only one comment appears to have survived the long trip. Here's my second reply:

Rob> NamedTupleReader and NamedTupleWriter should be inverses.  This
Rob> means that NamedTupleWriter needs to write headers.  This should
Rob> produce identical output as the dict writer example, but it's much
Rob> cleaner.

You're assuming that one instance of these classes will read or write an entire file. What if you want to append lines to an existing CSV file or pick up reading a file with a new reader which has already be partially processed?

msg82771 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-26 19:04

Let me be more explicit. I don't know how it implements it, but I think you really need to give the user the option of specifying the field names and not reading/writing headers. It can't be implicit as I interpreted Rob's earlier comment:

> NamedTupleReader and NamedTupleWriter should be inverses.
> This means that NamedTupleWriter needs to write headers.

Skip

msg82778 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-02-26 21:02

Skip> Let me be more explicit. I don't know how it implements it, but I think Skip> you really need to give the user the option of specifying the field Skip> names and not reading/writing headers. It can't be implicit as I Skip> interpreted Rob's earlier comment:

rrenaud> NamedTupleReader and NamedTupleWriter should be inverses.
rrenaud> This means that NamedTupleWriter needs to write headers.

I agree with Skip, we mustn't have a 'wroteheader' flag internal to the NamedTupleWriter.

Currently to write a 'header' row with a csv.writer you could (for example) pass a tuple of header names to writerow. NamedTupleWriter is no different, you would have a namedtuple of header names instead of a tuple of header names.

I would not like to see another flag added to the initialisation process to enable the writing of a header row as the 'first' (or any) row written to a file. We could add a function 'writeheader' that would write the contents of 'fieldnames' as a row, but I don't like the idea.

Cheers,

msg82780 - (view)

Author: Rob Renaud (rrenaud)

Date: 2009-02-26 22:18

I want to make sure I understand. Am I correct in believing that Skip thinks writing headers should be optional, while Jervis believes we should leave the burden to the NamedTupleWriter client?

I agree that we should not unconditionally write headers, but I think that we should write headers by default, much like we read them by default.

I believe the implicit header writing is very elegant, and the only reason that the DictWriter object doesn't write headers is the impedance mismatch between dicts and CSV. namedtuples has the field order information, the impedance mismatch is gone, we should no longer be hindered. Implicitly reading but not explicitly writing headers just seems wrong.

It also seems wrong to require the construction of "header" namedtuple objects. It's much less natural than dicts holding identity mappings.

Point._make(Point._fields) Point(x='x', y='y')

To me, that just looks weird and non-obvious to me. That Point instance doesn't really fit in my mind as something that should be a Point.

msg82798 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-27 00:55

Rob> I agree that we should not unconditionally write headers, but I Rob> think that we should write headers by default, much like we read Rob> them by default.

I don't think you should write them by default. I've worked with lots of CSV files which have no headers. I can imagine people wanting to write CSV files with multiple headers. It should be optional and explicit.

Skip

msg82799 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-27 01:00

More concretely, I don't think this is so onerous:

names = ["col1", "col2", "color"]
writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
writer.writerow(dict(zip(names, names)))
...

or

f = open("f.csv", "rb")
names = csv.reader(f).next()
reader = csv.DictReader(f, fieldnames=names, ...)
...

Skip

msg82812 - (view)

Author: Rob Renaud (rrenaud)

Date: 2009-02-27 02:16

I did a search on Google code for the DictReader constructor. I analyzed the first 3 pages, the fieldnames parameter was used in 14 of 27 cases (discounting unittest code built into Python) and was not used in 13 of 27 cases. I suppose that means headered csv files are sufficiently rare that they shouldn't be created implicitly by default. I still don't like the lack of symmetry of supporting implicit header reads, but not implicit header writes.

On Thu, Feb 26, 2009 at 8:00 PM, Skip Montanaro <report@bugs.python.org> wrote:

Skip Montanaro <skip@pobox.com> added the comment:

More concretely, I don't think this is so onerous:

names = ["col1", "col2", "color"] writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...) writer.writerow(dict(zip(names, names))) ...

or

f = open("f.csv", "rb") names = csv.reader(f).next() reader = csv.DictReader(f, fieldnames=names, ...) ...

Skip


Python tracker <report@bugs.python.org> <http://bugs.python.org/issue1818>


msg82814 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-02-27 02:43

I don't think you should write them by default.
I've worked with lots of CSV files which have no headers.

My experience has been the same as Skips.

msg82819 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-02-27 04:45

Rob> I still don't like the lack of symmetry of supporting implicit Rob> header reads, but not implicit header writes.

A header is nothing more than a row in the CSV file with special interpretation applied by the user. There is nothing implicit about it. If you know the first line is a header, use the recipe I posted. If not, supply your own fieldnames and treat the first row as data.

Skip

msg83298 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-03-08 04:34

Added a patch against py3k branch.

in csv.rst removed reference to reader.next() as a public method.

msg83299 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-03-08 04:40

Jervis> in csv.rst removed reference to reader.next() as a public method.

Because? I've not seen any discussion in this issue or in any other forums (most certainly not on the csv@python.org mailing list) which would suggest that csv.reader's next() method should no longer be a public method.

Skip

msg83310 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-03-08 13:33

I don't understand why NamedTupleReader requires the fieldnames array rather than the namedtuple class itself. If you could pass it the namedtuple class, users could choose whatever namedtuple subclass with whatever additional methods or behaviour suits them. It would make NamedTupleReader more flexible and more useful.

msg83318 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-03-08 19:13

I don't know how NamedTuple objects work, but in many situations you want the content of the CSV file to drive the output. I would think you would use a technique similar to my DictReader example to tell the NamedTupleReader the fieldnames. For that you need a fieldnames argument.

msg83321 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2009-03-08 19:21

I retract my previous comment. I don't use the DictReader the way it operates (fieldnames==None => first row is a header) and forgot about that behavior.

msg83332 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-03-08 22:53

Jervis> in csv.rst removed reference to reader.next() as a public method.

Skip> Because? I've not seen any discussion in this issue or in any Skip> other forums Skip> (most certainly not on the csv@python.org mailing list) which would Skip> suggest Skip> that csv.reader's next() method should no longer be a public method.

I agree, this should be applied separately.

msg83333 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-03-08 23:13

Antoine> I don't understand why NamedTupleReader requires the Antoine> fieldnames array Antoine> rather than the namedtuple class itself. If you could pass it Antoine> the namedtuple class, users could choose whatever namedtuple Antoine> subclass with whatever additional methods or behaviour suits Antoine> them. It would make NamedTupleReader more flexible and more Antoine> useful.

The NamedTupleReader does take the namedtuple class as the fieldnames argument. It can be a namedtuple, a 'fieldnames' array or None. If a namedtuple is used as the fieldnames argument, returned rows are created using ._make from the this namedtuple. Unless I have read your requirements incorrectly, this is the behaviour you describe.

Given the confusion, I accept that the documentation needs to be improved.

The NamedTupleReader and Writer were created to follow as closely as possible the behaviour (and signature) of the DictReader and DictWriter, with the exception of using namedtuples instead of dicts.

msg83334 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-03-08 23:21

Ok, I got misled by the documentation ("The contents of fieldnames are passed directly to be used as the namedtuple fieldnames"), and your implementation is a bit difficult to follow.

msg83340 - (view)

Author: Jervis Whitley (jdwhitley)

Date: 2009-03-09 00:06

Updated version of docs for 2.7 and 3k.

msg102936 - (view)

Author: Γ‰ric Araujo (eric.araujo) * (Python committer)

Date: 2010-04-12 10:57

See also this python-ideas thread: http://mail.python.org/pipermail/python-ideas/2010-April/006991.html

msg102959 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2010-04-12 17:08

Type conversion is a whole 'nuther kettle of fish. This particular thread is long and complex enough that it shouldn't be made more complex.

msg110598 - (view)

Author: Mark Lawrence (BreamoreBoy) *

Date: 2010-07-17 19:18

I suggest that this is closed unless anyone shows an active interest in it.

msg111523 - (view)

Author: Mark Lawrence (BreamoreBoy) *

Date: 2010-07-25 08:38

Closing as no response to .

msg111552 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2010-07-25 19:12

Re-opening because we ought to do something along these lines at some point. The DictReader and DictWriter are inadequate for preserving order and they are unnecessarily memory intensive (one dict per record).

FWIW, the non-conforming field name problem has already been solved by recent improvements to collections.namedtuple using rename=True.

msg115348 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2010-09-02 00:33

Unassigning, this needs fresh thought and a fresh patch from someone who can devote a little deep thinking on how to solve this problem cleanly. In the meantime, it is no problem to simply cast the CSV tuples into named tuples.

msg235710 - (view)

Author: Daniel Lenski (dlenski) *

Date: 2015-02-10 21:41

Here's the class I have been using for reading namedtuples from CSV files:

from collections import namedtuple
from itertools import imap
import csv

class CsvNamedTupleReader(object):
    __slots__ = ('_r', 'row', 'fieldnames')
    def __init__(self, *args, **kwargs):
        self._r = csv.reader(*args, **kwargs)
        self.row = namedtuple("row", self._r.next())
        self.fieldnames = self.row._fields

    def __iter__(self):
        #FIXME: how about this? return imap(self.row._make, self._r[:len(self.fieldnames)]
        return imap(self.row._make, self._r)

    dialect = property(lambda self: self._r.dialect)
    line_num = property(lambda self: self._r.line_num)

This class wraps csv.reader since it doesn't seem to be possible to inherit from it. It uses itertools.imap to iterate over the rows output by csv.reader and convert them to the namedtuple class.

One thing that needs fixing (marked with FIXME above) is what to do in the case of a row which has more fields than the header row. The simplest solution is simply to truncate such a row, but perhaps more options are needed, similar to those offered by DictReader.

msg241599 - (view)

Author: Ilia Kurenkov (copper-head) *

Date: 2015-04-20 04:07

As my contribution during the sprints at PyCon 2015, I've tweaked Jervis's patch a little and updated the tests/docs to work with Python 3.5.

My only real change was placing the basic reader object inside a generator expression that filters out empty lines. Being partial to functional programming I find this removes some of the code clutter in next(), letting that method focus on turning rows into tuples.

Hopefully this will rekindle the discussion!

msg241601 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2015-04-20 04:12

Skip or Barry, do you want to look at this?

msg242879 - (view)

Author: Ilia Kurenkov (copper-head) *

Date: 2015-05-11 02:00

Friendly reminder that this exists.

I know everyone's busy and this is marked as low-priority, but I'm gonna keep bumping this till we add a solution :)

msg242893 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2015-05-11 13:56

I looked at this six years ago. I still haven't found a situation where I pined for a NamedTupleReader. That said, I have no objection to committing it if others, more well-versed in current Python code and NamedTuples than I gives it a pass. Note that I added a couple comments to the csv.py diff, but nobody either updated the code or explained why I was out in the weeds in my comments.

msg311182 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2018-01-29 22:17

FWIW, I relinquished my check-in privileges quite awhile ago. This should almost certainly no longer be assigned to me.

S

History

Date

User

Action

Args

2022-04-11 14:56:29

admin

set

github: 46143

2020-12-22 02:45:54

rhettinger

set

status: open -> closed
stage: patch review -> resolved

2018-01-29 22:17:50

skip.montanaro

set

messages: +

2018-01-29 20:57:29

rhettinger

set

priority: low -> normal
versions: + Python 3.8, - Python 3.5

2015-05-29 06:50:46

ced

set

nosy: + ced

2015-05-11 13:56:54

skip.montanaro

set

messages: +

2015-05-11 02:00:23

copper-head

set

messages: +

2015-04-20 04:12:03

rhettinger

set

versions: - Python 3.3
nosy: + skip.montanaro

messages: +

assignee: skip.montanaro
stage: needs patch -> patch review

2015-04-20 04:07:57

copper-head

set

files: + 1818_py35.diff
versions: + Python 3.5
nosy: + copper-head

messages: +

2015-02-10 21:41:01

dlenski

set

nosy: + dlenski
messages: +

2014-02-03 19:05:18

BreamoreBoy

set

nosy: - BreamoreBoy

2012-12-14 03:53:54

asvetlov

set

nosy: + asvetlov

2012-09-06 12:06:51

ainur0160

set

nosy: - ainur0160

2012-09-06 11:48:56

ainur0160

set

nosy: + ainur0160

2010-09-02 00:33:39

rhettinger

set

priority: normal -> low
versions: - Python 3.2
messages: +

assignee: rhettinger -> (no value)
stage: patch review -> needs patch

2010-07-25 19:12:12

rhettinger

set

status: closed -> open
assignee: barry -> rhettinger
messages: +

2010-07-25 08:38:11

BreamoreBoy

set

status: pending -> closed

messages: +

2010-07-17 19πŸ”ž15

BreamoreBoy

set

status: open -> pending
versions: + Python 3.2, Python 3.3, - Python 3.1, Python 2.7
nosy: + BreamoreBoy

messages: +

2010-05-20 20:38:50

skip.montanaro

set

nosy: - skip.montanaro

2010-04-12 17:08:27

skip.montanaro

set

messages: +

2010-04-12 10:57:47

eric.araujo

set

nosy: + eric.araujo
messages: +

2009-03-09 00:08:01

jdwhitley

set

files: - ntreader5_py3_1.diff

2009-03-09 00:07:46

jdwhitley

set

files: + ntreader6_py27.diff

2009-03-09 00:07:04

jdwhitley

set

files: + ntreader6_py3.diff

messages: +

2009-03-08 23:21:31

pitrou

set

messages: +

2009-03-08 23:13:33

jdwhitley

set

messages: +

2009-03-08 22:53:06

jdwhitley

set

files: + ntreader5_py3_1.diff

messages: +

2009-03-08 19:21:00

skip.montanaro

set

messages: +

2009-03-08 19:19:42

skip.montanaro

set

messages: -

2009-03-08 19πŸ”ž15

skip.montanaro

set

messages: +

2009-03-08 19:13:24

skip.montanaro

set

messages: +

2009-03-08 13:33:50

pitrou

set

nosy: + pitrou
messages: +

2009-03-08 04:40:59

skip.montanaro

set

messages: +

2009-03-08 04:34:07

jdwhitley

set

files: + ntreader4_py3_1.diff
messages: +

2009-02-27 04:45:49

skip.montanaro

set

messages: +

2009-02-27 02:43:22

rhettinger

set

messages: +

2009-02-27 02:16:55

rrenaud

set

messages: +

2009-02-27 01:00:23

skip.montanaro

set

messages: +

2009-02-27 00:55:40

skip.montanaro

set

messages: +

2009-02-26 22πŸ”ž40

rrenaud

set

messages: +

2009-02-26 21:02:58

jdwhitley

set

messages: +

2009-02-26 19:04:27

skip.montanaro

set

messages: +

2009-02-26 19:02:03

skip.montanaro

set

messages: +

2009-02-26 15:47:25

barry

set

messages: +

2009-02-26 15:44:54

skip.montanaro

set

messages: +

2009-02-26 08:01:10

rhettinger

set

type: enhancement
stage: patch review
messages: +
versions: + Python 3.1, Python 2.7, - Python 2.6

2009-02-26 07:59:24

rrenaud

set

files: - named_tuple_write_header.patch

2009-02-26 07:59:15

rrenaud

set

files: + named_tuple_write_header2.patch
messages: +

2009-02-26 07:38:36

rrenaud

set

files: + named_tuple_write_header.patch
nosy: + rrenaud
messages: +

2009-02-10 11:08:09

jdwhitley

set

files: + ntreader4.diff
messages: +

2009-02-10 01:25:16

rhettinger

set

messages: +

2009-02-09 16:54:00

rhettinger

set

messages: +

2009-02-09 09:25:00

jdwhitley

set

files: + ntreader3.diff
nosy: + jdwhitley
messages: +
keywords: + patch

2008-01-22 20:12:41

skip.montanaro

set

nosy: + skip.montanaro
messages: +

2008-01-22 19:25:56

rhettinger

set

messages: +

2008-01-13 22:27:14

rhettinger

create