[Python-Dev] PEP 3101 implementation vs. documentation (original) (raw)

Ben Wolfson wolfson at gmail.com
Fri Jun 10 23:15:54 CEST 2011

Previous message: [Python-Dev] Summary of Python tracker Issues
Next message: [Python-Dev] PEP 3101 implementation vs. documentation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

I'm writing because discussion in a bug report I submitted (<http://bugs.python.org/issue12014>) has suggested that, insofar as at least part of the issue revolves around the interpretation of PEP 3101, that aspect belonged on python-dev. In particular, I was told that the PEP, not the documentation, is authoritative. Since I'm the one who thinks something is wrong, it seems appropriate for me to be the one to bring it up.

Basically, the issue is that the current behavior of str.format is at variance at least with the documentation <http://docs.python.org/library/string.html#formatstrings>, is almost certainly at variance with PEP3101 in one respect, and is in my opinion at variance with PEP3101 in another respect as well, regarding what characters can be present in what the grammar given in the documentation calls an element_index, that is, the bit between the square brackets in "{0.attr[idx]}".

Both discovering the current behavior and interpreting the documentation are pretty straightforward; interpreting what the PEP actually calls for is more vexed. I'll do the first two things first. TOC for the remainder:

What does the current implementation do?
What does the documentation say?
What does the PEP say? [this part is long, but the PEP is not clear, and I wanted to be thorough]
Who cares?
What does the current implementation do?

Suppose you have this dictionary:

d = {"@": 0, "!": 1, ":": 2, "^": 3, "}": 4, "{": {"}": 5}, }

Then the following expressions have the following results:

(a) "{0[@]}".format(d) --> '0' (b) "{0[!]}".format(d) --> ValueError: Missing ']' in format string (c) "{0[:]}".format(d) --> ValueError: Missing ']' in format string (d) "{0[^]}".format(d) --> '3' (e) "{0[}]}".format(d) --> ValueError: Missing ']' in format string (f) "{0[{]}".format(d) --> ValueError: unmatched '{' in format (g) "{0[{][}]}".format(d) --> '5'

Given (e) and (f), I think (g) should be a little surprising, though you can probably guess what's going on and it's not hard to see why it happens if you look at the source: (e) and (f) fail because MarkupIterator_next (in Objects/stringlib/string_format.h) scans through the string looking for curly braces, because it treats them as semantically significant no matter what context they occur in. So, according to MarkupIterator_next, the first right curly brace in (e) indicates the end of the replacement field, giving "{0[}". In (f), the second left curly brace indicates (to MarkupIterator_next) the start of a new replacement field, and since there's only one right curly brace, it complains. In (g), MarkupIterator_next treats the second left curly brace as starting a new replacement field and the first right curly brace as closing it. However, actually, those braces don't define new replacement fields, as indicated by the fact that the whole expression treats the element_index fields as just plain old strings. (So the current implementation is somewhat schizophrenic, acting at one point as if the braces have to be balanced because '{[]}' is a replacement field and at another point treating the braces as insignificant.)

The explanation for (b) and (c) is that parse_field (same source file) treats ':' and '!' as indicating the end of the field_name section of the replacement field, regardless of whether those characters occur within square brackets or not.

So, that's what the current implementation does, in those cases.

What does the documentation say?

The documentation gives a grammar for replacement fields:

""" replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}" field_name ::= arg_name ("." attribute_name | "[" element_index "]")* arg_name ::= [identifier | integer] attribute_name ::= identifier element_index ::= integer | index_string index_string ::= <any source character except "]"> + conversion ::= "r" | "s" format_spec ::= """

Given this definition of index_string, all of (a)--(g) should be legal, and the results should be the strings '0', '1', '2', '3', "{'}': 5}", and '5'. There is no need to exclude ':', '!', '}', or '{' from the index_string field; allowing them creates no ambiguity, because the field is delimited by square brackets.

Tangent: the definition of attribute_name here also does not describe the current behavior ("{0. ;}".format(x) works fine and will call getattr(x, " ;")) and the PEP does not require the attribute_name to be an identifier; in fact it explicitly states that the attribute_name doesn't need to be a valid Python identifier. attribute_name should read (to reflect something like actual behavior, anyway) "<any source character except '[', '.', ':', '!', '{', or '}'> +". The same goes for arg_name (with the integer alternation). Observe:

x = lambda: None setattr(x, ']]', 3) "{].]]}".format(**{"]":x}) # (h) '3'

One can also presently do this (hence "something like actual behavior"):

setattr(x, 'f}', 4) "{a{s.f}}".format(**{"a{s":x}) '4' But not this: "{a{s.funcname}".format(**{"a{s":x}) as it raises a ValueError, for the same reason as explains (g) above.

What does the PEP say?

Well... It's actually hard to tell!

Summary: The PEP does not contain a grammar for replacement fields, and is surprisingly nonspecific about what can appear where, at least when talking about the part of the replacement field before the format specifier. The most reasonable interpretation of the parts of the PEP that seem to be relevant favors the documentation, rather than the implementation.

This can be separated into two sub-questions.

A. What does the PEP say about ':' and '!'?

It says two things that pertain to element_index fields.

The first is this: """ The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is used as a string.

Because keys are not quote-delimited, it is not possible to
specify arbitrary dictionary keys (e.g., the strings "10" or
":-]") from within a format string.

"""

So it notes that some things can't be used:

Because anything composed entirely of digits is treated as a number, you can't get a string composed entirely of digits. Clear enough.
What's the explanation for the second example, though? Well, you can't have a right square bracket in the index_string, so that would already mean that you can't do this: "{0[:-]]}".format(...) regardless of the whether colons are legal or not. So, although the PEP gives an example of a string that can't in the element_index part of a replacement field, and that string contains a colon, that string would have been disallowed for other reasons anyway.

The second is this:

""" The str.format() function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.). """

This requires some interpretation. For one thing, the contents of an element_index aren't identifiers. For another, it's not true that you're done with an identifier (or whatever) whenever you see any of those; it depends on context. When parsing this: "{0[a.b]}" the parser should not stop at the '.'; it should keep going until it reaches the ']', and that will give the element_index. And when parsing this: "{0.f]oo[bar].baz}", it's done with the identifier "foo" when it reaches the '[', not when it reaches the second '.', and not when it reaches the ']', either (recall (h)). The "minimalist parser" is, I take it, one that works like this: when parsing an arg_name you're done when you reach a '[', a ':', a '!', a '.', '{', or a '}'; the same rules apply when parsing a attribute_name; when parsing an element_index you're done when you see a ']'.

Now as regards the curly braces there are some other parts of the PEP that perhaps should be taken into account (see below), but I can find no justification for excluding ':' and '!' from the element_index field; the bit quoted above about having a minimalist parser isn't a good justification for that, and it's the only part of the entire PEP that comes close to addressing the question.

B. What does it say about '}' and '{'?

It still doesn't say much explicitly. There's the quotation I just gave, and then these passages:

""" Brace characters ('curly braces') are used to indicate a replacement field within the string:

[...]

Braces can be escaped by doubling:

"""

Taken by itself, this would suggest that wherever there's an unescaped brace, there's a replacement field. That would mean that the current implementation's behavior is correct in (e) and (f) but incorrect in (g). However, I think this is a bad interpretation; unescaped braces can indicate the presence of a replacement field without that meaning that within a replacement field braces have that meaning, no matter where within the replacement field they occur.

Later in the PEP, talking about this example:

    "{0:{1}}".format(a, b)

We have this:

""" These 'internal' replacement fields can only occur in the format specifier part of the replacement field. Internal replacement fields cannot themselves have format specifiers. This implies also that replacement fields cannot be nested to arbitrary levels.

Note that the doubled '}' at the end, which would normally be
escaped, is not escaped in this case.  The reason is because
the '{{' and '}}' syntax for escapes is only applied when used
*outside* of a format field.  Within a format field, the brace
characters always have their normal meaning.

"""

The claim "within a format field, the brace characters always have their normal meaning" might be taken to mean that within a replacement field, brace characters always indicate the start (or end) of a replacement field. But the PEP at this point is clearly talking about the formatting section of a replacement field---the part that follows the ':', if present. ("Format field" is nowhere defined in the PEP, but it seems reasonable to take it to mean "the format specifier of a replacement field".) However, it seems most reasonable to me to take "normal meaning" to mean "just a brace character".

Note that the present implementation only kinda sorta conforms to the PEP in this respect:

import datetime format(datetime.datetime.now(), "{{%Y") '{{2011' "{0:{{%{1}}".format(datetime.datetime.now(), 'Y') # (i) Traceback (most recent call last): File "", line 1, in ValueError: unmatched '{' in format "{0:{{%{1}}}}".format(datetime.datetime.now(), 'Y') # (j) '{2011}'

Here the brace characters in (i) and (j) are treated, again in MarkupIterator_next, as indicating the start of a replacement field. In (i), this leads the function to throw an exception; since they're balanced in (j), processing proceeds further, and the doubled braces aren't treated as indicating the start or end of a replacement field---because they're escaped. Given that the format spec part of a replacement field can contain further replacement fields, this is, I think, correct behavior, but it's not easy to see how it comports with the PEP, whose language is not very exact.

The call to the built-in format() bypasses the mechanism that leads to these results.

The PEP is very, very nonspecific about the parts of the replacement field that precede the format specifier. I don't know what kind of discussion surrounded the drawing up of the grammar that appears in the documentation, but I think that it, and not the implementation, should be followed.

The implementation only works the way it does because of parsing shortcuts: it raises ValueErrors for (b) and (c) because it generalizes something true of the attribute_name field (encountering a ':' or '!' means one has moved on to the format_spec or conversion part of the replacement field) to the element_index field. And it raises an error for (e) and (f), but lets (g) through, for the reasons already mentioned. It is, in that respect, inconsistent; it treats the curly brace as having one semantic significance at one point and an entirely different significance at another point, so that it does the right thing in the case of (g) entirely by accident. There is, I think, no way to construe the PEP so that it is reasonable to do what the present implementation does in all three cases (if "{" indicates the start of a replacement field in (f), it should do so in (g) as well); I think it's actually pretty difficult to construe the PEP in any way that makes what it does in the case of (e) and (f) correct.

Who cares?

Well, I do. (Obviously.) I even have a use case: I want to be able to put arbitrary (or as close to arbitrary as possible) strings in the element_index field, because I've got some objects that (should!) enable me to do this:

p.say("I'm warning you, {e.red.underline[don't touch that!]}")

and have this written ("e" for "effects") to p.out:

I'm warning you, \x1b[31m\x1b[4mdon't touch that!\x1b[0m

I have a way around the square bracket problem, but it would be quite burdensome to have to deal with all of !:{} as well; enough that I would fall back on something like this:

"I'm warning you, {0}".format(e.red.underline("don't touch that!"))

or some similar interpolation-based strategy, which I think would be a shame, because of the way it chops up the string.

But I also think the present behavior is extremely counterintuitive, unnecessarily complex, and illogical (even unpythonic!). Isn't it bizarre that (g) should work, given what (e) and (f) do? Isn't it strange that (b) and (c) should fail, given that there's no real reason for them to do so---no ambiguity that has to be avoided? And something's gotta give; the documentation and the implementation do not correspond.

Beyond the counterintuitiveness of the present implementation, it is also, I think, self-inconsistent. (e) and (f) fail because the interior brace is treated as starting or ending a replacement field, even though interior replacement fields aren't allowed in that part of a replacement field. (g) succeeds because the braces are balanced: they are treated at one point as if they were part of a replacement field, and at another (correctly) as if they are not. But this makes the failure of (e) and (f) unaccountable. It would not have been impossible for the PEP to allow interior replacement fields anywhere, and not just in the format spec, in which case we might have had this:

(g') "{0[{][}]}".format(range(10), **{'][':4}) --> '3' or this: (g'') "{0[{][}]}".format({'4':3}, **{'][':4}) --> '3' or something with that general flavor.

As far as I can tell, the only way to consistently maintain that (e) and (f) should fail requires that one take (g') or (g'') to be correct: either the interior braces signal replacement fields (hence must be balanced) or they don't (or they're escaped).

Regarding the documentation, it could of course be rendered correct by changing it, rather than the implementation. But wouldn't it be tremendously weird to have to explain that, in the part of the replacement field preceding the conversion, you can't employ any curly braces, unless they're balanced, and you can't employ ':' or '!' at all, even though they have no semantic significance? So these are out:

{0[{].foo} {0[}{}]} {0[a:b]}

But these are in:

{0[{}{}]} {0[{{].foo.}}} (k)

((k) does work, if you give it an object with the right structure, though it probably should not.)

And, moreover, someone would then have to actually change the documentation, whereas there's a patch already, attached to the bug report linked way up at the top of this email, that makes (a)--(g) all work, leaves (i) and (j) as they are, and has the welcome side-effect of making (k) not work (if any code anywhere relies on (k) or things like it working, I will be very surprised: anyway the fact that (k) works is, technically, undocumented). It is also quite simple. It doesn't effect the built-in format(), because the built-in format() is concerned only with the format-specifier part of the replacement field and not all the stuff that comes before that, telling str.format what object to format.

Thanks for reading,

Ben Wolfson "Human kind has used its intelligence to vary the flavour of drinks, which may be sweet, aromatic, fermented or spirit-based. ... Family and social life also offer numerous other occasions to consume drinks for pleasure." [Larousse, "Drink" entry]

Previous message: [Python-Dev] Summary of Python tracker Issues
Next message: [Python-Dev] PEP 3101 implementation vs. documentation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list