[Python-Dev] CSV, bytes and encodings (original) (raw)

skip at pobox.com skip at pobox.com
Wed Apr 1 12:37:38 CEST 2009


>> Having read through the ticket, it seems that a CSV file must be (and
>> 2.6 was) treated as a binary file, and part of the CSV module's job
>> is to convert that binary data to and from strings.

Antoine> IMO this interpretation is flawed.  In 2.6 there is no tangible
Antoine> difference between "binary" and "text" files, except for
Antoine> newline handling. Also, as a matter of fact, if you want the
Antoine> 2.x CSV module to read a file with Windows line endings, you
Antoine> have to open the file in "rU" mode (that is, the closest we
Antoine> have to a moral equivalent of the 3.x text files).

The problem is that fields in CSV files, at least those produced by Excel, can contain embedded newlines. You are welcome to decide that all CRLF pairs should be translated to LF, but that is not the decision the original authors (mostly Andrew MacNamara) made. The contents of the fields was deemed to be separate from the newline convention, so the csv module needed to do its own newline processing, and thus required files to be opened in binary mode.

This case arises rarely, but it does turn up every now and again. If you are comfortable with translating all CRLF pairs into LF, no matter if they are true end-of-line markers or embedded content, that's fine. (It certainly simplifies the implementation.) However, a) I would run it past the folks on csv at python.org first, and b) put a big fat note in the module docs about the transformation.

Antoine> Therefore, I don't think 2.x is of any guidance to us for what
Antoine> 3.x should do.

I suspect we will disagree on this. I believe the behavior of the 2.x version of the module is easily defensible and should be a useful guide to how the 3.x version of the module behaves.

>> The documentation says "If csvfile is a file object, it must be
>> opened with the $,1rx(Bb$,1ry(B flag on platforms where that makes a difference."

Antoine> The documentation is, IMO, wrong even in 2.x. Just yesterday I
Antoine> had to open a CSV file in 'rU' mode because it had Windows line
Antoine> endings and I'm under Linux....

See above. You almost certainly didn't have fields containing CRLF pairs or didn't care that while reading the file your data values were silently altered.

Skip



More information about the Python-Dev mailing list