[Python-Dev] PEP 461 updates (original) (raw)
Oscar Benjamin oscar.j.benjamin at gmail.com
Sun Jan 19 16:21:25 CET 2014
- Previous message: [Python-Dev] PEP 461 updates
- Next message: [Python-Dev] PEP 461 updates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 19 January 2014 06:19, Nick Coghlan <ncoghlan at gmail.com> wrote:
While I agree it's not relevant to the PEP 460/461 discussions, so long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file and then converts the strings to bytes with the as_bytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1).
If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding).
This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16).
Then instead of having a compat module with an as_bytes helper to get rid of all the unicode strings on Python 3, you can have a compat module with an open_unicode helper to do the right thing on Python 2. The as_bytes function is just a way of fighting the Python 3 text model: "I don't care about mojibake just do whatever it takes to shut up the interpreter and its error messages and make sure it works for ASCII data."
If it is intended to allow S columns to contain text in arbitrary encodings, then that should also be supported by the current API with an adjustment to the default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing. However, if you're currently decoding S columns with latin-1 before passing the value to the converter, then you'll need to use a WSGI style decoding dance instead:
def fixencoding(text): return text.encode("latin-1").decode("utf-8") # For example
That's just getting silly IMO. If the file uses mixed encodings then I don't consider it a valid "text file" and see no reason for loadtxt to support reading it.
That's more wasteful than just passing the raw bytes through for decoding, but is the simplest backwards compatible option if you're doing latin-1 decoding already.
If different rows in the same column are allowed to have different encodings, then that's not a valid use of the operation (since the column converter has no access to the rest of the row to determine what encoding should be used for the decode operation).
Ditto.
Oscar
- Previous message: [Python-Dev] PEP 461 updates
- Next message: [Python-Dev] PEP 461 updates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]