[Python-Dev] PEP 461 updates (original) (raw)
Chris Barker chris.barker at noaa.gov
Tue Jan 21 17:57:52 CET 2014
- Previous message: [Python-Dev] PEP 461 updates
- Next message: [Python-Dev] PEP 461 updates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin <oscar.j.benjamin at gmail.com>wrote:
> long as numpy.loadtxt is explicitly documented as only working with > latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file
which is a really bad default -- oh well. Also, I don't think it was a choice, at least not a well thought out one, but rather what fell out of tryin gto make it "just work" on py3.
and then converts the strings to
bytes with the asbytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1).
even if it was a common default it would be a "bad idea". Fortunately (?), so it really is broken, we can fix it without being too constrained by backwards compatibility.
> If it's supposed to work with other encodings (but the entire file is > still required to use a consistent encoding), then it just needs > encoding and errors arguments to fit the Python 3 text model (with > "latin-1" documented as the default encoding). This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16).
Exactly, except I dont think the system encoding as a default is a good choice. If there is a default MOST people will use it. And it will work for a lot of their test code. Then it will break if the code is passed to a system with a different default encoding, or a file comes from another source in a different encoding. This is very, very likely. Far more likely that files consistently being in the system encoding....
> default behaviour, since passing something like > codecs.getdecoder("utf-8") as a column converter should do the right > thing.
that seems to work at the moment, actually, if done with care.
That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid "text file" and see no reason for loadtxt to support reading it.
agreed -- that's just getting crazy -- the only use-case I can image is to clean up a file that got moji-baked by some other process -- not really the use case for loadtxt and friends.
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140121/0e34cf57/attachment.html>
- Previous message: [Python-Dev] PEP 461 updates
- Next message: [Python-Dev] PEP 461 updates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]