[Python-3000] Proposed changes to PEP3101 advanced string formatting -- please discuss and vote! (original) (raw)
Patrick Maupin pmaupin at gmail.com
Tue Mar 13 03:47:21 CET 2007
- Previous message: [Python-3000] Discussions with no PEPs
- Next message: [Python-3000] Proposed changes to PEP3101 advanced string formatting -- please discuss and vote!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Eric Smith and I have a reasonable first-cut of a C implementation for Talin's PEP3101 (it runs as an extension module and has been tested on Python 2.3, 2.4, and 3.0) along with some test cases. It's sort of experimental, in that it mostly implements the PEP, but also implements a few possible enhancements and changes, to give us (well, me, mostly) an idea of what works and what doesn't, and ideas about changes that might be useful.
This list of potential changes to the PEP is in order of (what I believe to be) most contentious first. (It will be interesting to contrast that with the actual votes :). I apologize for the long list, but it's quite a comprehensive PEP, and the implementation work Eric and I have done has probably encompassed the most critical examination of the PEP to date, so here goes:
Feature: Alternate syntaxes for escape to markup.
There is naturally a strong desire to resist having more than one way to do things. However, in my experience using Python to generate lots of text (which experience is why I volunteered to work on the PEP implementation in the first place!), I have found that there are different application domains which naturally lend themselves to different approaches for escape to markup. The issues are:
In some application domains, the number of {} characters in the actual character data is dizzying. This leads to two problems: first that there are an inordinate number of { and } characters to escape by doubling, and second (and even more troubling) is that it can be very difficult to locate the markup inside the text file, without an editor (and user!) that understands fancy regexps.
In other application domains, the sheer number of {} characters is not so bad (and markup can be readily noticed), but the use of {{ for { is very confusing, because { has a particular technical meaning (rather than just offsetting some natural language text), and it is hard for people to mentally parse the underlying document when the braces are doubled.
To deal with these issues, I propose having a total of 3 markup syntaxes, with one of them being the default, and with a readable, defined method for the string to declare it is using one of the other markup syntaxes. (If this is accepted, I feel that EIBTI says the string itself should declare if it is using other than the standard markup transition sequence. This also makes it easier for any automated tools to understand the insides of the strings.)
The default markup syntax is the one proposed in the initial PEP, where literal { and } characters are denoted by doubling them. It works well for many problem domains, and is the best for short strings (where it would be burdensome to have the string declare the markup syntax it is using).
The second method is the well-understood syntax.The{} syntax. The syntax.The is easy to find in a sea of { characters, and the only special handling required is that every $ must be doubled.
The third method is something I came up with a couple of years ago when I was generating a lot of C code from templates. It would make any non-Python programmer blanch, because it relies on significant whitespace, but it made for very readable technical templates. WIth this method "{foo}" escapes to markup, but when there is whitespace after the leading "{", e.g. "{ foo}", the brace is not an escape to markup. If the whitespace is a space, it is removed from the output, but if it is '\r', '\n', or '\t', then it is left in the output. The result is that, for example, most braces in most C texts do not need to have spaces inserted after them, because they immediately precede a newline.
The syntaxes are similar enough that they can all be efficiently parsed by the same loop, so there are no real implementation issues. The currently contemplated method for declaring a markup syntax is by using decorator-style markup, e.g. {@syntax1} inside the string, although I am open to suggestions about better-looking ways to do this.
Feature: Automatic search of locals() and globals() for name lookups if no parameters are given.
This is contentious because it violates EIBTI. However, it is extremely convenient. To me, the reasons for allowing or disallowing this feature on 'somestring'.format() appear to be exactly the same as the reasons for allowing or disallowing this feature on eval('somestring'). Barring a distinction between these cases that I have not noticed, I think that if we don't want to allow this for 'somestring'.format(), then we should seriously consider removing the capability in Python 3000 for eval('somestring').
Feature: Ability to pass in a dictionary or tuple of dictionaries of namespaces to search.
This feature allows, in some cases, for much more dynamic code than *kwargs. (You could manually smush multiple dictionaries together to build kwargs, but that can be ugly, tedious, and slow.) Implementation-wise, this feature and locals() / globals() go hand in hand.
Feature: Placement of a dummy record on the traceback stack for underlying errors.
There are two classes of exceptions which can occur during formatting: exceptions generated by the formatter code itself, and exceptions generated by user code (such as a field object's getattr function, or the field_hook function).
In general, exceptions generated by the formatter code itself are of the "ValueError" variety -- there is an error in the actual "value" of the format string. (This is not strictly true; for example, the string.format() function might be passed a non-string as its first parameter, which would result in a TypeError.)
The text associated with these internally generated ValueError exceptions will indicate the location of the exception inside the format string, as well as the nature of the exception.
For exceptions generated by user code, a trace record and dummy frame will be added to the traceback stack to help in determining the location in the string where the exception occurred. The inserted traceback will indicate that the error occurred at:
File "?format_string?", line X, in column_Y
where X and Y represent the line and character position of where the exception occurred inside the format string.
THIS IS A HACK!
There is currently no general mechanism for non-Python source code to be added to a traceback (which might be the subject of another PEP), but there is some precedent and some examples, for example, in PyExpat and in Pyrex, of adding non-Python code information to the traceback stack. Even though this is a hack, the necessity of debugging format strings makes this a very useful hack -- the hack means that the location information of the error within the format string is available and can be manipulated, just like the location of the error within any Python module for which source is not available.
Removed feature: Ability to dump error information into the output string.
The original PEP added this capability for the same reason I am proposing the traceback hack. OTOH, with the traceback hack in place, it would be extremely easy to write a pure-Python wrapper for format that would trap an exception, parse the traceback, and place the error message at the appropriate place in the string. (For a bit more effort, the exception wrapper could re-call the formatter with the remainder of the string if you really wanted to see ALL the errors, but this is actually much more functionality than you get with Python itself for debugging, so while I see the usefulness, I don't know if it justifies the effort.)
Feature: Addition of functions and "constants" to string module.
The PEP proposes doing everything as string methods, with a "cformat" method allowing some access to the underlying machinery. I propose only having a 'format' method of the string (unicode) type, and a corresponding 'format' and extended 'flag_format' function in the string module, along with definitions for the flags for access to non-default underlying format machinery.
Feature: Ability for "field hook" user code function to only be called on some fields.
The PEP takes an all-or-nothing approach to the field hook -- it is either called on every field or no fields. Furthermore, it is only available for calling if the extended function ('somestring'.cformat() in the spec, string.flag_format() in this proposal) is called. The proposed change keeps this functionality, but also adds a field type specifier 'h' which causes the field hook to be called as needed on a per-field basis. This latter method can even be used from the default 'somestring'.format() method.
Changed feature: By default, not using all arguments is not an exception
The original PEP says that, by default, the formatting operation should check that all arguments are used.
In the original % operator, the exception reporting that an extra argument is present is symmetrical to the exception reporting that not enough arguments are present. Both these errors are easy to commit, because it is hard to count the number of arguments and the number of % specifiers in your string and make sure they match. In theory, the new string formatting should make it easier to get the arguments right, because all arguments in the format string are numbered or even named, and with the new string formatting, the corresponding error is that a specific argument is missing, not just "you didn't supply enough arguments."
Also, it is arguably not Pythonic to require a check that all arguments to a function are actually used by the execution of the function (but see interfaces!), and format() is, after all, just another function. So it seems that the default should be to not check that all the arguments are used. In fact, there are similar reasons for not using all the arguments here as with any other function. For example, for customization, the format method of a string might be called with a superset of all the information which might be useful to view.
Because the ability to check that all arguments is used might be useful (and because I don't want to be accused of leaving this feature out because of laziness :), this feature remains available if string.flag_format() is called.
Feature: Ability to insert non-printing comments in format strings
This feature is implemented in a very intuitive way, e.g. " text {# your comment here} more text" (example shown with the default transition to markup syntax). One of the nice benefits of this feature is the ability to break up long source lines (if you have lots of long variable names and attribute lookups).
Feature: Exception raised if attribute with leading underscore accessed.
The syntax supported by the PEP is deliberately limited in an attempt to increase security. This is an additional security measure, which is on by default, but can be optionally disabled if string.flag_format() is used instead of 'somestring'.format().
Feature: Support for "center" alignment.
The field specifier uses "<" and ">" for left and right alignment. This adds "^" for center alignment.
Feature: support of earlier versions of Python
(This was contemplated but not mandated in the earlier PEP, but implementation has proven to be quite easy.)
This should work with both string and unicode objects in 2.6, and should be available as a separate compilable extension module for versions back to 2.3 (functions only -- no string method support).
Feature: no global state
The original PEP specified a global error-handling mode, which controlled a few capabilities. The current implementation has no global state -- any non-default options are accessed by using the string.flag_format() function.
Thanks in advance for any feedback you can provide on these changes.
Regards, Pat
- Previous message: [Python-3000] Discussions with no PEPs
- Next message: [Python-3000] Proposed changes to PEP3101 advanced string formatting -- please discuss and vote!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]