Design note: Unambiguous parsing (original) (raw)

Q: How does Cpp2 parse things that seem ambiguous? For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those? Or what about a<b,c>d, doesn't that depend on whether a is a template or a comparable object?

A: First match wins, there are no comma-expressions, and a relational comparison in a template-argument must be parenthesized

Let's take these one at a time. First:

For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those?

In Cpp2, my intent is that the productions are tried in order. By deterministically taking the first match, we can eliminate any ambiguity when input could match more than one production. In the template-argument example, this means that "if it can be an expression, it is." (Similarly, statement has several productions, and some program text could match more than one production; the parsing rule is "if it can be a some earlier production, it is.") (***)

Here's the parse.h code for template-argument:

if (auto e = expression(false)) {   // disallow unparenthesized relational comparisons in template args
    term.arg = std::move(e);
}
else if (auto i = id_expression()) {
    term.arg = std::move(i);
}

Note that this snippet's comment also answers the second example... next:

Or what about a<b,c>d, doesn't that depend on whether a is a template or a comparable object?

Ah, templates and commas and relational comparisons, oh my! In today's C++ these are a delightful (let's call it that) parse because of the comma operator (note this can be visually rather than actually ambiguous) and the overloading of the < and > tokens for template parameter/argument lists and relational comparisons.

In Cpp2 my intent is to make this a non-problem (famous last words!) for both actual and visual ambiguity:

The reason I deliberately disallow unparenthesized relational comparisons in template arguments is as a simple way to avoid the many subtle issues that have complicated today's C++ parsers in this area.(**) For example, a regularly cited one is this minor variation of the above example:

In Cpp2 that won't compile as is, because a<b> parses as a template-id, which can't be followed by an identifier like c. But it works fine with the required parentheses:

a < ( b > c ) , d >   // ok

which parses a<...> as a template-id with two arguments, b>c and d.


(*) Template < > syntax, other options considered

I also considered other options.

For a while I tried making template parameter/argument lists use [ and ] instead of < and >. This had some theoretical benefits:

However, eventually I decided < > was better for two reasons:

Also, I know that some languages put a prefix on template/generic argument lists (e.g., !). By itself, that doesn't solve the problems here if we still use < and >, and (admittedly a matter of my personal taste) I considered it a bit ugly, so I didn't want to use it except as a last resort if nothing else worked. I think the current solution is better for Cpp2... other solutions may well be better for other languages, so this is not at all a diss of other languages' choices!


(**) Why < > , make parsing today's C++ hard

Even though in today's C++ grammar many of challenging cases aren't actually ambiguous, they still surely are:

This comes up even if you're not writing a full C++ parser, just an approximate "tag" parser that tries to be much faster than an accurate C++ parser. In the past decade I've seen several efforts where people write new C++ parsers of both kinds, accurate and approximate, and each parser author I've talked to raised this class of example. The problem examples are just common enough in real code that even an approximate parser has to do something sensible with them.

Note these complications don't arise in C#, even though C# also uses < and > for generics. That's because C# doesn't have non-type template arguments (or the capability to do type computations that would involve relational comparison) and so a relational expression naturally cannot occur in a generic argument in C#. C# also doesn't have comma-expressions. So there seems to be some prior art experience that simply banning (unparenthesized) relational comparisons inside template argument lists, and for good (visual) measure banning comma-expressions everywhere, would similarly sidestep the issue in Cpp2.


(***) Parsing issues in today's C++

The "if it can be an expression, it is" disambiguation I describe above for Cpp2 (see where this note is referenced) is completely unlike today's vexing parse "if it can be a declaration, it is" disambiguation, which is truly awful because it requires directing the parse production selection based on whether a name is a type or not... which means today it requires name lookup (semantic analysis!) to decide the parse, which is totally backward from the proper "lex -> parse -> sema" order with all steps independent.

If you're not familiar with issues in today's C++, here are two examples.

Basic vexing parse example (today's C++, not Cpp2)

Here's my standard vexing parse example... see Godbolt link:

// Illustrate sema (name lookup) changing parsing in today's C++
// Note that #if 1 vs. #if 0 give totally different parse trees
// https://godbolt.org/z/eoo53hxje

struct a {};

#if 1
    a c;
#else
    struct c {};
#endif

int main() {
    a b(c);
    if constexpr( std::is_function_v<decltype(b)> ) {
        std::cout << "it's a function\n";
    }
    else {
        std::cout << "it's not a function\n";
    }
}

// #if 1, prints "it's not a function"

// #if 0, prints "it's a function"
Thomas Neumann's templates-and-commas parse change example (today's C++, not Cpp2)

Thomas Neumann gives the following similar example that illustrates fun with comma operators and templates... see Godbolt link:

// Illustrate sema (name lookup) changing parsing in today's C++
// Note that #if 1 vs. #if 0 give totally different parse trees
// https://godbolt.org/z/4Kzvz6xoP

#if 1
    template <int T, int T2>
    struct foo {};
#else
    constexpr int foo = 2;
    constexpr int a = 3;
#endif

int main() {
    foo<1,2> a;
}

// #if 1, gcc and Clang emit this warning:
//  unused variable 'a'

// #if 0, gcc and Clang emit this warning:
//   left operand of comma operator has no effect

Today, code like a b(c); and foo<1,2> a; is unambiguous, but very difficult to figure out (in roughly increasing order of importance):

Cpp2 strictly avoids this, and never requires sema to guide parsing. When I say "context-free parsing" this is the primary thing I have in mind... the compiler (and the human) can always parse the current piece of code without getting non-local information from elsewhere.

Aside: I've heard people complain about unbounded lookahead as another parsing issue in today's C++, but that just isn't a major problem in practice in my experience. Sure, it's always nice to reduce lookahead, but lookahead doesn't create major problems for compilers, tools, or humans to anywhere near the degree that (a) preprocessor macros and (b) parse-sema inversion/entanglement do. When I hear from people who are trying to write new parsers or new refactoring tools for C++, I regularly hear them bemoan macros and parse-sema inversion/entanglement, but I don't remember ever hearing them complain about lookahead. YMMV.