Design note: Unambiguous parsing (original) (raw)

Q: How does Cpp2 parse things that seem ambiguous? For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those? Or what about `a<b,c>d`, doesn't that depend on whether `a` is a template or a comparable object?

A: First match wins, there are no comma-expressions, and a relational comparison in a template-argument must be parenthesized

Let's take these one at a time. First:

For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those?

In Cpp2, my intent is that the productions are tried in order. By deterministically taking the first match, we can eliminate any ambiguity when input could match more than one production. In the template-argument example, this means that "if it can be an expression, it is." (Similarly, statement has several productions, and some program text could match more than one production; the parsing rule is "if it can be a some earlier production, it is.") (***)

Here's the parse.h code for template-argument:

if (auto e = expression(false)) {   // disallow unparenthesized relational comparisons in template args
    term.arg = std::move(e);
}
else if (auto i = id_expression()) {
    term.arg = std::move(i);
}

Note that this snippet's comment also answers the second example... next:

Or what about a<b,c>d, doesn't that depend on whether a is a template or a comparable object?

Ah, templates and commas and relational comparisons, oh my! In today's C++ these are a delightful (let's call it that) parse because of the comma operator (note this can be visually rather than actually ambiguous) and the overloading of the < and > tokens for template parameter/argument lists and relational comparisons.

In Cpp2 my intent is to make this a non-problem (famous last words!) for both actual and visual ambiguity:

Cpp2 has no comma operator, so that removes the possibility of b,c visually being a comma-expression.
Cpp2 requires a relational comparison expression in a template-argument-list to be parenthesized. (*)

The reason I deliberately disallow unparenthesized relational comparisons in template arguments is as a simple way to avoid the many subtle issues that have complicated today's C++ parsers in this area.(**) For example, a regularly cited one is this minor variation of the above example:

In Cpp2 that won't compile as is, because a<b> parses as a template-id, which can't be followed by an identifier like c. But it works fine with the required parentheses:

a < ( b > c ) , d >   // ok

which parses a<...> as a template-id with two arguments, b>c and d.

(*) Template `<` `>` syntax, other options considered

I also considered other options.

For a while I tried making template parameter/argument lists use [ and ] instead of < and >. This had some theoretical benefits:

It avoided the overloaded uses of < > for relational comparison.
[ ] are balanced in the language (outside comments and literals) whereas < > are not because of the tokens' additional use as relational operators, and using balanced tokens enables lazy parsing to skip sections more easily (as D does, and as cppfront itself does to rely on balanced { } to skip Cpp1 code sections without having to fully parse them).

However, eventually I decided < > was better for two reasons:

It makes Cpp2 much easier for cppfront and for programmers, by letting both easily distinguish the meaning of a<b> vs a[b] locally without having to go off and do name lookup (in the compiler or by manual scrolling around) to figure out whether a is the name of a template or of an object that overloaded operator[].
It reduced the delta with today's Cpp1 syntax, which I want to minimize unless there's a compelling reason to be different (and here there isn't).

Also, I know that some languages put a prefix on template/generic argument lists (e.g., !). By itself, that doesn't solve the problems here if we still use < and >, and (admittedly a matter of my personal taste) I considered it a bit ugly, so I didn't want to use it except as a last resort if nothing else worked. I think the current solution is better for Cpp2... other solutions may well be better for other languages, so this is not at all a diss of other languages' choices!

(**) Why `<` `>` `,` make parsing today's C++ hard

Even though in today's C++ grammar many of challenging cases aren't actually ambiguous, they still surely are:

visually ambiguous to programmers, so I want to improve them under the goal of improving "simplicity"; and
a known pain in the posterior to C++ parser writers, so I want to improve them under the goal of improving "toolability".

This comes up even if you're not writing a full C++ parser, just an approximate "tag" parser that tries to be much faster than an accurate C++ parser. In the past decade I've seen several efforts where people write new C++ parsers of both kinds, accurate and approximate, and each parser author I've talked to raised this class of example. The problem examples are just common enough in real code that even an approximate parser has to do something sensible with them.

Note these complications don't arise in C#, even though C# also uses < and > for generics. That's because C# doesn't have non-type template arguments (or the capability to do type computations that would involve relational comparison) and so a relational expression naturally cannot occur in a generic argument in C#. C# also doesn't have comma-expressions. So there seems to be some prior art experience that simply banning (unparenthesized) relational comparisons inside template argument lists, and for good (visual) measure banning comma-expressions everywhere, would similarly sidestep the issue in Cpp2.

(***) Parsing issues in today's C++

The "if it can be an expression, it is" disambiguation I describe above for Cpp2 (see where this note is referenced) is completely unlike today's vexing parse "if it can be a declaration, it is" disambiguation, which is truly awful because it requires directing the parse production selection based on whether a name is a type or not... which means today it requires name lookup (semantic analysis!) to decide the parse, which is totally backward from the proper "lex -> parse -> sema" order with all steps independent.

If you're not familiar with issues in today's C++, here are two examples.

Basic vexing parse example (today's C++, not Cpp2)

Here's my standard vexing parse example... see Godbolt link:

// Illustrate sema (name lookup) changing parsing in today's C++
// Note that #if 1 vs. #if 0 give totally different parse trees
// https://godbolt.org/z/eoo53hxje

struct a {};

#if 1
    a c;
#else
    struct c {};
#endif

int main() {
    a b(c);
    if constexpr( std::is_function_v<decltype(b)> ) {
        std::cout << "it's a function\n";
    }
    else {
        std::cout << "it's not a function\n";
    }
}

// #if 1, prints "it's not a function"

// #if 0, prints "it's a function"

Thomas Neumann's templates-and-commas parse change example (today's C++, not Cpp2)

Thomas Neumann gives the following similar example that illustrates fun with comma operators and templates... see Godbolt link:

// Illustrate sema (name lookup) changing parsing in today's C++
// Note that #if 1 vs. #if 0 give totally different parse trees
// https://godbolt.org/z/4Kzvz6xoP

#if 1
    template <int T, int T2>
    struct foo {};
#else
    constexpr int foo = 2;
    constexpr int a = 3;
#endif

int main() {
    foo<1,2> a;
}

// #if 1, gcc and Clang emit this warning:
//  unused variable 'a'

// #if 0, gcc and Clang emit this warning:
//   left operand of comma operator has no effect

Today, code like a b(c); and foo<1,2> a; is unambiguous, but very difficult to figure out (in roughly increasing order of importance):

for the compiler, which is required to be a Time Lord (or use a crystal ball) to jump into the should-be-future sema phase to figure out how to guide the previous parse phase and then jump back to the should-be-past unfinished parse phase;
for tools, such as a refactoring tool that wants to move main's line of code to somewhere else, but where the identical code can not only change meaning due to name lookup and the names in scope (which can happen sometimes in any language) but actually have a totally different language parse meaning (which goes far beyond the 'identical code could change meaning elsewhere' problem in most languages); and
(most importantly) for the human reading the code, who has to do the same non-local go-look-elsewhere-to-see-what-this-identifier-is before they can know what the code they just wrote even means.

Cpp2 strictly avoids this, and never requires sema to guide parsing. When I say "context-free parsing" this is the primary thing I have in mind... the compiler (and the human) can always parse the current piece of code without getting non-local information from elsewhere.

Aside: I've heard people complain about unbounded lookahead as another parsing issue in today's C++, but that just isn't a major problem in practice in my experience. Sure, it's always nice to reduce lookahead, but lookahead doesn't create major problems for compilers, tools, or humans to anywhere near the degree that (a) preprocessor macros and (b) parse-sema inversion/entanglement do. When I hear from people who are trying to write new parsers or new refactoring tools for C++, I regularly hear them bemoan macros and parse-sema inversion/entanglement, but I don't remember ever hearing them complain about lookahead. YMMV.