[css-syntax] Review requested of new Parsing text · Issue #8834 · w3c/csswg-drafts (original) (raw)

Since we resolved on the new Nesting behavior (try to parse as a declaration, then parse as a rule if that's invalid), I had to do some decent rewriting of Syntax's algorithms, and went ahead and dove into a larger rewrite to clean it up in general. I've implemented the new text in my CSS parser library and the (fairly limited, admittedly) tests I've run look good, but I'd appreciate a larger review.

Significant changes from the previous version:

Algorithm structure generally changed; rather than consuming a token and often reconsuming for another algorithm to deal with, it just always relies on lookahead and doesn't consume tokens until they're actually going to be used for certain. This should better resemble how an actual parser works. (I haven't changed the tokenizer to this structure, but doing so is probably a good idea at some point.)
Previously, I had "consume a list of rules" for rule+at-rules and "consume a list of declarations" for declarations+at-rules. Stylesheets and things like @media used "list of rules"; style rules and things like @font-face used "list of declarations". I've shifted all blocks to just use the new "consume a block's contents", and since stylesheets are now the only user of "list of rules", renamed it to "consume a stylesheet's contents" and specialized it to always ignore the CDO/CDC tokens.

Aside from allowing the new nesting behavior, all of these changes should be only editorial, with one exception: blocks that previously only contained rules (@media, @keyframes, etc) previously used the "consume a list of rules", but now use the unified "consume a block's contents", which means their error-recovery in the face of semicolons changes.

For example, @media { garbage; bar {...} } previously would contain a style rule with a garbage; bar selector. (This is what happens at the top-level of a stylesheet, still.) Now the rule's selector will be just bar, since the garbage; part will get dropped as an invalid declaration. This means that rules which were accidentally invalid and dropped due to garbage might now be valid, if there's a semicolon separating them from preceding garbage.

I suspect this is fine, and I'd really like it to be, because it means the overall parsing behavior doesn't need to branch on grammar knowledge (and thus, whether a rule is known or unknown won't change its generic parsing). It used to be the case that parsing depended on this kind of knowledge, and it was super awkward to use in tooling. Also, it means that parsing doesn't change between a top-level @media and a nested one, except for declarations becoming valid instead of invalid; all the rules inside of the @media remain precisely the same.

But if necessary, we can hardcode some at-rules to trigger a different parsing behavior that preserves backwards compatibility more completely.

(Technically parsing in general depends on grammar knowledge anyway, since you need to know whether a declaration is valid in a given context to tell if you should try and redo parsing as a rule. But it turns out there's a simple and reliable rule you can use generically to get approximately the right behavior without having to know anything about grammars.)