Issue 23041: csv needs more quoting rules (original) (raw)

Created on 2014-12-12 16:36 by samwyse, last changed 2022-04-11 14:58 by admin.

Messages (26)

msg232560 - (view)

Author: Samwyse (samwyse) *

Date: 2014-12-12 16:36

The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow.

Current behavior (which will remain unchanged)

csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL) csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42]) "foo","","42"

csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC) csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42]) "foo","",42

Proposed behavior

csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL) csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42]) "foo",,"42"

csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS) csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42]) "foo",,42

msg232561 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2014-12-12 16:52

As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me.

msg232563 - (view)

Author: Samwyse (samwyse) *

Date: 2014-12-12 17:27

David: That's not a problem for me.

Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules.

typedef enum { QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_STRINGS, QUOTE_NOTNULL } QuoteStyle;

static StyleDesc quote_styles[] = { { QUOTE_MINIMAL, "QUOTE_MINIMAL" }, { QUOTE_ALL, "QUOTE_ALL" }, { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" }, { QUOTE_NONE, "QUOTE_NONE" }, { QUOTE_STRINGS, "QUOTE_STRINGS" }, { QUOTE_NOTNULL, "QUOTE_NOTNULL" }, { 0 } };

    switch (dialect->quoting) {
    case QUOTE_NONNUMERIC:
        quoted = !PyNumber_Check(field);
        break;
    case QUOTE_ALL:
        quoted = 1;
        break;
    case QUOTE_STRINGS:
        quoted = PyString_Check(field);
        break;
    case QUOTE_NOTNULL:
        quoted = field != Py_None;
        break;
    default:
        quoted = 0;
        break;
    }

" csv.QUOTE_MINIMAL means only when required, for example, when a\n" " field contains either the quotechar or the delimiter\n" " csv.QUOTE_ALL means that quotes are always placed around fields.\n" " csv.QUOTE_NONNUMERIC means that quotes are always placed around\n" " fields which do not parse as integers or floating point\n" " numbers.\n" " csv.QUOTE_STRINGS means that quotes are always placed around\n" " fields which are strings. Note that the Python value None\n" " is not a string.\n" " csv.QUOTE_NOTNULL means that quotes are only placed around fields\n" " that are not the Python value None.\n"

msg232596 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2014-12-13 06:46

Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?

msg232607 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2014-12-13 13:05

It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?

msg232630 - (view)

Author: Chaitanya agrawal (krypten) *

Date: 2014-12-14 05:14

Used function PyUnicode_Check instead of PyString_Check

msg232676 - (view)

Author: Samwyse (samwyse) *

Date: 2014-12-15 19:40

Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files.

msg232677 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2014-12-15 19:52

If I understand correctly, your software needs to distinguish between

wrote ["foo", "", 42, None] with quote_all in effect

"foo","","42",""

and

wrote ["foo", None, 42, ""] with quote_nonnull in effect

"foo",,"42",""

so you in effect want to transmit some type information through a CSV file?

msg232681 - (view)

Author: Samwyse (samwyse) *

Date: 2014-12-15 21:17

Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all.

Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know.

msg261141 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2016-03-02 23:06

Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False?

I think the better course here is to either:

If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.

msg261146 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-03-03 05:37

The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program.

If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.

msg261147 - (view)

Author: Berker Peksag (berker.peksag) * (Python committer)

Date: 2016-03-03 07:09

I was thinking adding a more flexible API like:

...
spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None)
...

But that would require too much change in the csv module (or at least its implementation wouldn't be trivial).

I agree that subclassing DictWriter is a much better way to achieve this.

msg341460 - (view)

Author: (tegdev)

Date: 2019-05-05 17:19

The correct handling of None values belongs to the csv module.

There is a use case to migrate a DB2 database to PostgreSQL. DB2 has a command line tool "db2 export ..." which produces csv-files. A row ['Hello', null, 'world'] is exported to "Hello,,"world".

I would like to read in these exports with python and put it to PostgreSQL.

But with the csv library I can't read it in correctly. The input is converted to: ['Hello', '', 'world'] It should read as: ['Hello', None, 'world']

It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow.

And last but not least: if someone writes a list the reading should the identity. Thats not True for the csv libraray.

Example:

import csv hello_out_lst = ['Hello', None, 'world'] with open('hello.csv', 'w') as ofh: writer = csv.writer(ofh, delimiter=',') writer.writerow(hello_out_lst)

with open('hello.csv', 'r') as ifh: reader = csv.reader(ifh, delimiter=',') for row in reader: hello_in_lst = row

is_equal = hello_out_lst == hello_in_lst print(f'{hello_out_lst} is equal {hello_in_lst} ? {is_equal}')

The result is: ['Hello', None, 'world'] is equal ['Hello', '', 'world'] ? False

msg358461 - (view)

Author: Yoong Hor Meng (yoonghm)

Date: 2019-12-16 02:20

There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful.

msg396621 - (view)

Author: Alexandre Morignot (erdnaxeli)

Date: 2021-06-28 09:23

I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values.

Could we reconsider this issue and the proposed solution?

msg396641 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-06-28 15:57

Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after?

csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL) csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42]) "",,"42"

? Can you modify the original two patches to restrict to QUOTE_NONNULL?

msg396642 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-06-28 15:58

Missed tweaking a couple settings.

msg396643 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-06-28 16:00

Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/

Not, Non, None... Perl would treat them all the same, right?

msg401607 - (view)

Author: Miha Šetina (msetina)

Date: 2021-09-10 18:28

The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data. The csv.QUOTE_STRINGS would enable a smaller file size. Our case is that we are moving data into Synapse and have to resort to risk prone string replace functionality to provide the MS Synapse COPY a proper support for empty string and NULL

msg401608 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-09-10 18:35

Update version - too late for anything older than 3.11.

msg405951 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-11-08 14:02

Note to @samwyse and @krypten: I updated the patches and created a pull request on GitHub, but I have no way of knowing if at least krypten has signed a CLA for Python. Since you're the author of the original patches, we need to verify that you have.

msg406013 - (view)

Author: Skip Montanaro (skip.montanaro) * (Python triager)

Date: 2021-11-09 12:06

Further question... All the discussion has been on the writer side of the csv module. Is there any reason that using QUOTE_STRINGS or QUOTE_NOTNULL should have an effect when reading? For example, should this line on input

"",,1,'a'

produce this list

["", None, "1", "a"]

with QUOTE_NOTNULL in effect, and

["", "", 1, "a"]

or

["", None, 1, "a"]

with QUOTE_STRINGS in effect?

msg406015 - (view)

Author: Miha Šetina (msetina)

Date: 2021-11-09 12:33

I would say that it should follow a rule: both cases values of None are output as an empty field

so empty filed should map to None with both QUOTE_NOTNULL and QUOTE_STRINGS

that would make: ["", None, 1, "a"]

for QUOTE_STRINGS in effect. This would also make it possible to have circular usage. Export with one quoting rule should be able to read with the same rule and produce same data.

msg406017 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2021-11-09 12:51

The quoting style affects not only the CSV formatting, but also the CSV parsing. How QUOTE_NOTNULL and QUOTE_STRINGS will affect parsing?

msg412463 - (view)

Author: Miha Šetina (msetina)

Date: 2022-02-03 20:08

Is this still on track for python 3.11?

msg413867 - (view)

Author: Samwyse (samwyse) *

Date: 2022-02-24 00:07

I just signed the contributor agreement. (Thought I had done that last year but I don’t see any emails. Is there any place to check?)

I agree that round-tripping should Bebe possible for any value of quoting.

Hopefully this will finally get done before its eighth birthday.

History

Date

User

Action

Args

2022-04-11 14:58:11

admin

set

github: 67230

2022-02-24 00:07:17

samwyse

set

messages: +

2022-02-03 20:08:22

msetina

set

messages: +

2021-11-09 12:51:14

serhiy.storchaka

set

messages: +

2021-11-09 12:33:11

msetina

set

messages: +

2021-11-09 12:06:46

skip.montanaro

set

messages: +

2021-11-08 14:02:34

skip.montanaro

set

messages: +

2021-11-08 13:57:11

skip.montanaro

set

stage: needs patch -> patch review
pull_requests: + <pull%5Frequest27722>

2021-09-10 18:35:17

skip.montanaro

set

messages: +
versions: - Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10

2021-09-10 18:28:11

msetina

set

nosy: + msetina

messages: +
versions: + Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10

2021-06-28 16:00:31

skip.montanaro

set

messages: +

2021-06-28 15:58:26

skip.montanaro

set

resolution: rejected ->
stage: resolved -> needs patch
messages: +
versions: + Python 3.11, - Python 3.8

2021-06-28 15:57:52

skip.montanaro

set

status: closed -> open

messages: +

2021-06-28 09:23:46

erdnaxeli

set

nosy: + erdnaxeli
messages: +

2019-12-16 02:20:00

yoonghm

set

nosy: + yoonghm

messages: +
versions: + Python 3.8, - Python 3.6

2019-05-05 17:19:38

tegdev

set

nosy: + tegdev
messages: +

2016-03-03 07:09:55

berker.peksag

set

status: open -> closed

messages: +
stage: patch review -> resolved

2016-03-03 05:37:28

serhiy.storchaka

set

nosy: + serhiy.storchaka
messages: +

2016-03-02 23:06:14

skip.montanaro

set

resolution: rejected
messages: +

2016-02-28 05:13:01

berker.peksag

set

nosy: + berker.peksag
stage: needs patch -> patch review

versions: + Python 3.6, - Python 3.5

2014-12-15 21:17:02

samwyse

set

messages: +

2014-12-15 19:52:44

skip.montanaro

set

messages: +

2014-12-15 19:40:10

samwyse

set

messages: +

2014-12-14 05:45:01

krypten

set

files: + issue23041_test.patch

2014-12-14 05:14:55

krypten

set

files: + issue23041.patch

nosy: + krypten
messages: +

keywords: + patch

2014-12-13 13:05:03

skip.montanaro

set

messages: +

2014-12-13 06:46:50

rhettinger

set

assignee: skip.montanaro

messages: +
nosy: + rhettinger, skip.montanaro

2014-12-12 17:27:41

samwyse

set

messages: +

2014-12-12 16:52:33

r.david.murray

set

versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
nosy: + r.david.murray

messages: +

keywords: + easy
stage: needs patch

2014-12-12 16:36:45

samwyse

create