[Python-ideas] Exploring the 'strview' concept further (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Wed Dec 7 15:53:44 CET 2011


With encouragement from me (and others) Armin Ronacher recently attempted to articulate his problems in dealing with the migration to Python 3 [1]. They're actually quite similar to the feelings I had during my early attempts at restoring the ability of the URL parsing APIs to deal directly with ASCII-encoded binary data, rather than requiring that the application developer explicitly decode it to text first [2].

Now, I clearly disagree with Armin on at least one point: there already is "one true way" to have unified text processing code in Python 3. That way is the way the Python 3.2 urllib.parse module handles it: as soon as it is handed something that isn't a string, it attempts to decode it using a default assumed encoding (specifically 'ascii', at least for now). It keeps track of whether or not the arguments were decoded from bytes and, if they were, encodes the return value on output [3]. If you're pipelining such interfaces, it's obviously more efficiently to just decode once before invoking the pipeline and then (optionally) encoding again at the end (just as is the case in Python 2), but you can still make your APIs largely polymorphic with respect to bytes and text without massive internal code duplication.

So, that's always one of my first suggestions to people struggling with Python 3's unicode model: I ask if they have tried putting aside any concerns they may have about possible losses of efficiency, and just tried the decode-on-input-and-return-an-output-coercion-function, coerce-on-output approach. Python used to do this implicitly for you at every string operation (minus the 'coerce on output' part), but now it is asking that you do it manually, and decide for yourself on an appropriate encoding, instead of the automatic assumption of ASCII text that is present in Python 2 (we'll leave aside the issue of platform-specific defaults in various contexts - that's a whole different question and one I'm not at all equipped to answer. I don't think I've ever even had to work on a system with any locale other than en_US or en_GB).

Often this actually resolves their problem (since they're no longer fighting the new Unicode model, and instead embracing it), and this is why PEP 393 is going to be such a big deal when Python 3.3 is released next year. Protocol developers are right to be worried about a four-fold increase in memory usage (and the flow on effects on CPU usage and cache misses) when going from bytes data to the UCS4 internal Unicode format used on most distro-provided Python builds for Linux. With PEP 393's flexible internal representations, the amount of memory used will be as little as possible while still allowing straightforward O(1) lookup of individual code points.

However, that urllib.urlparse code also highlights another one of Armin's complaints: like much of the stdlib (and core interpreter!), it doesn't ducktype 'str'. Instead, it demands the real thing and accepts no substitutes (not even collections.UserString). This kind of behaviour is quite endemic - the coupling between the interpreter and the details of the string implementation is, in general, even tighter than that between the interpreter and the dict implementation used for namespaces.

With PEP 3118, we introduced the concept of 'memoryview' to make allowance for the fact that it is often useful to look at the same chunk of memory in multiple ways, without incurring the costs of making multiple copies. In a discussion back in June [4], I briefly mentioned the idea of a 'strview' type that would extend those concepts to providing a str-like view of a region of memory, without necessarily making a copy of the entire thing.

DISCLAIMERS:

  1. I don't know yet if this is a good idea. It may in fact be a terrible idea. I think it is, at least, an idea worth discussing further.
  2. Making this concept work may require actually classifying our codecs to some degree (for attributes like 'ASCII-compatible', 'stateless', 'fixed width', etc). That might be tedious, but doesn't seem completely infeasible.
  3. There are issues with memoryview itself that should be accounted for if pursuing this idea [5]
  4. There is an issue with CPython's operand coercion for sequence concatenation and repetition that may affect attempts to implement this idea, although you should be fine so long as you implement the number methods in addition to the sequence ones (which happens automatically for classes written in Python) [6]

So, how might a 'strview' object work?

  1. The basic construction would be "strview(object, encoding, errors)". For convenience, actual str objects would just be returned unmodified (alternatively: a factory function could be provided with that behaviour)
  2. A 'strview' wouldn't try to pass itself off as a real string for all purposes. Instead, it would support a new String ABC (more on that below).
  3. The encode() method would work like a string's normal encode() method, decoding the original object to a str, then encoding that to the desired encoding. If the encodings match, then an optimised fast path of simply calling bytes() on the underlying object would be used.
  4. If asked to index, slice or iterate over the underlying string, the strview would use the incremental decoder for the relevant codec to build an efficient mapping from code point indices to byte indices and then return real strings (various strategies for doing this have been posted to this list in the past). Alternatively, if codecs were classified to explicitly indicate when they implemented stateless fixed width encodings, then strview could simply be restricted to only working with that subset of possible encodings. The latter strategy might be needed to get around issues with stateful encodings like ShiftJIS and ITA2 - those are hard (impossible?) to index and interpret efficiently without fully decoding them and storing the result.
  5. The new type would implement the various binary operators supported by strings, promoting itself to a real string type whenever needed
  6. The new type would similarly support the full string API, returning actual string objects rather than any kind of view.

What might a String ABC provide?

For a very long time, slice indices had to be real integers - we didn't allow other "integer like" types. The reason was that floats implemented int, so ducktyping on that method would have allowed binary floating point numbers in functions where we didn't want to permit them. The answer, ultimately, was to introduce index (and, eventually, numbers.Integral) to mark "true" integers, allowing things like NumPy scalars to be used directly as slice indices without inheriting from int.

An explicit String ABC, even if not supported for performance critical core functionality like identifiers, would allow the implementation of code like that in urllib.urlparse to be updated to avoid keying behaviour on the concrete builtin str type - instead, it would check against the String ABC, allowing for all the usual explicit type registration goodies that ABCs support (and that make them much better for type checking than concrete types).

Just as much of the old UserDict functionality is now available on Mapping and MutableMapping, so much of the existing UserString functionality could be moved to the hypothetical String ABC.

Hopefully-the-rambling-isn't-too-incoherent'ly-yours, Nick.

[1] http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/ [2] http://bugs.python.org/issue9873 [3] http://hg.python.org/cpython/file/default/Lib/urllib/parse.py#l74 [4] http://mail.python.org/pipermail/python-ideas/2011-June/010439.html [5] http://bugs.python.org/issue10181 [6] http://bugs.python.org/issue11477

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-ideas mailing list