[Python-Dev] Add transform() and untranform() methods (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sat Nov 16 10:44:51 CET 2013


On 16 Nov 2013 10:47, "Victor Stinner" <victor.stinner at gmail.com> wrote:

2013/11/16 Nick Coghlan <ncoghlan at gmail.com>: > To address Serhiy's security concerns with the compression codecs (which are > technically independent of the question of restoring the aliases), I also > plan to document how to systematically blacklist particular codecs in an > application by setting attributes on the encodings module and/or appropriate > entries in sys.modules. I would be simpler and safer to blacklist bytes=>bytes and str=>str codecs from bytes.decode() and str.encode() directly. Marc Andre Lemburg proposed to add new attributes in CodecInfo to specify input and output types.

Yes, but that type compatibility introspection is a change for 3.5 at the earliest (although I commented on http://bugs.python.org/issue19619 with two alternate suggestions that I think would be reasonable to implement for 3.4).

Everything codec related that I am doing at the moment is about improving the state of 3.4 and source compatible 2/3 code. Proposals for further 3.5+ only improvements are relevant only in the sense that I don't want to lock us out from future improvements (which is why my main aim is to clarify the status quo, with the only functional changes related to restoring feature parity with Python 2 for non-Unicode codecs).

> The only functional change I'd still like to make for 3.4 is to restore > the shorthand aliases for the non-Unicode codecs (to ease the migration for > folks coming from Python 2), but this thread has convinced me I likely need > to write the PEP before doing that, and I still have to integrate > ensurepip into pyvenv before the beta 1 deadline. > > So unless you and Victor are prepared to +1 the restoration of the codec > aliases (closing issue 7475) in anticipation of that codecs infrastructure > documentation PEP, the change to restore the aliases probably won't be in > 3.4. (I might get the PEP written in time regardless, but I'm not betting > on it at this point).

Using StackOverflow search engine, I found some posts where people asks for "hex" codec on Python 3. There are two answers: use binascii module or use codecs.encode(). So even if codecs.encode() was never documented, it looks like it is used. So I now agree that documenting it would not make the situation worse.

Aye, that was my conclusion (hence my proposal on issue 7475 back in April).

Can I take that observation as a +1 for restoring the aliases as well? (That and more efficiently rejecting the non-Unicode codecs from str.encode, bytes.decode and bytearray.decode are the only aspects of this subject to the beta 1 deadline - we can be a bit more leisurely when it comes to working out the details of the docs updates)

Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4.

In my opinion, the best option is to add new inputtype/outputtype attributes to CodecInfo right now, and modify the codecs so "abc".encode("hex") raises a LookupError (instead of tricky error message with some evil low-level hacks on the traceback and the exception, which is my initial concern in this mail thread). It fixes also the security vulnerability.

The C level code for catching the input type errors only looks evil because:

However, the ugliness of that code is the reason I'm intrigued by the possibility of traceback annotations as a potentially cleaner solution than trying to seamlessly wrap exceptions with a new one that adds more context information. While I think the gain in codec debuggability is worth it in this case, my concern over the complexity and the current limitations are the reason I didn't make it a public C API.

To keep backward compatibility (even with custom codecs registered manually), if inputtype/outputtype is not defined, we should consider that the codec is a classical text encoding (encode str=>bytes, decode bytes=>str).

Without an already existing ByteSequence ABC , it isn't feasible to propose and implement this completely in the 3.4 time frame (since you would need such an ABC to express the input type accepted by our Unicode and binary codecs - the only one that wouldn't need it is rot_13, since that's str->str).

However, the output types could be expressed solely as concrete types, and that's all we need for the blacklist (since we could replace the current instance check on the result with a subclass check on the specified output type (if any) prior to decoding.

Cheers, Nick.



More information about the Python-Dev mailing list