[Python-Dev] PEP 461 updates (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sun Jan 19 07:19:00 CET 2014

Previous message: [Python-Dev] PEP 461 updates
Next message: [Python-Dev] PEP 461 updates
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 19 January 2014 00:39, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:

If you want to draw a relevant lesson from that thread in this one then the lesson argues against PEP 461: adding back the bytes formatting methods helps people who refuse to understand text processing and continue implementing dirty hacks instead of doing it properly.

Yes, that's why it has taken so long to even consider bringing binary interpolation support back - one of our primary concerns in the early days of Python 3 was developers (including core developers!) attempting to translate bad habits from Python 2 into Python 3 by continuing to treat binary data as text. Making interpolation a purely text domain operation helped strongly in enforcing this distinction, as it generally required thinking about encoding issues in order to get things into the text domain (or hitting them with the "latin-1" hammer, in which case... sigh).

The reason PEP 460/461 came up is that we do acknowledge that there is a legitimate use case for binary interpolation support when dealing with binary formats that contain ASCII compatible segments. Now that people have had a few years to get used to the Python 3 text model , lowering the barrier to migration from Python 2 and better handling that use case in Python 3 in general has finally tilted the scales in favour of providing the feature (assuming Guido is happy with PEP 461 after Ethan finishes the Rationale section).

(Tangent)

While I agree it's not relevant to the PEP 460/461 discussions, so long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem. If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding). If it is intended to allow S columns to contain text in arbitrary encodings, then that should also be supported by the current API with an adjustment to the default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing. However, if you're currently decoding S columns with latin-1 before passing the value to the converter, then you'll need to use a WSGI style decoding dance instead:

def fix_encoding(text):
    return text.encode("latin-1").decode("utf-8") # For example

That's more wasteful than just passing the raw bytes through for decoding, but is the simplest backwards compatible option if you're doing latin-1 decoding already.

If different rows in the same column are allowed to have different encodings, then that's not a valid use of the operation (since the column converter has no access to the rest of the row to determine what encoding should be used for the decode operation).

Cheers, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia

Previous message: [Python-Dev] PEP 461 updates
Next message: [Python-Dev] PEP 461 updates
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list