(original) (raw)

On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

> long as numpy.loadtxt is explicitly documented as only working with
\> latin-1 encoded files (it currently isn't), there's no problem.

Actually there is problem. If it explicitly specified the encoding as
latin-1 when opening the file then it could document the fact that it
works for latin-1 encoded files. However it actually uses the system
default encoding to read the file

which is a really bad default -- oh well. Also, I don't think it was a choice, at least not a well thought out one, but rather what fell out of tryin gto make it "just work" on py3.

and then converts the strings to
bytes with the as\_bytes function that is hard-coded to use latin-1:
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

So it only works if the system default encoding is latin-1 and the
file content is white-space and newline compatible with latin-1.
Regardless of whether the file itself is in utf-8 or latin-1 it will
only work if the system default encoding is latin-1\. I've never used a
system that had latin-1 as the default encoding (unless you count
cp1252 as latin-1).

even if it was a common default it would be a "bad idea". Fortunately (?), so it really is broken, we can fix it without being too constrained by backwards compatibility.

\> If it's supposed to work with other encodings (but the entire file is
\> still required to use a consistent encoding), then it just needs
\> encoding and errors arguments to fit the Python 3 text model (with
\> "latin-1" documented as the default encoding).

This is the right solution. Have an encoding argument, document the
fact that it will use the system default encoding if none is
specified, and re-encode using the same encoding to fit any dtype='S'
bytes column. This will then work for any encoding including the ones
that aren't ASCII-compatible (e.g. utf-16).

Exactly, except I dont think the system encoding as a default is a good choice. If there is a default MOST people will use it. And it will work for a lot of their test code. Then it will break if the code is passed to a system with a different default encoding, or a file comes from another source in a different encoding. This is very, very likely. Far more likely that files consistently being in the system encoding....

> default behaviour, since passing something like
\> codecs.getdecoder("utf-8") as a column converter should do the right
\> thing.

that seems to work at the moment, actually, if done with care.

That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid "text file" and see no reason for loadtxt to
support reading it.

agreed -- that's just getting crazy -- the only use-case I can image is to clean up a file that got moji-baked by some other process -- not really the use case for loadtxt and friends.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax

Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov