lexer: Treat more floats with empty exponent as valid tokens by richard-uk1 · Pull Request #131656 · rust-lang/rust (original) (raw)

Summary

This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.

This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.

Motivation

When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (u8, i16, f32 etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse 1px into some structure like { number: f64, unit: &str }: { number: 1.0, unit: Pixels }. However, this is not possible for prefixes that begin with e or E, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.

This PR removes the special case for e/E, so numeric suffixes like em can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like #xxxxxx where x is a hexadecimal character [0-9a-fA-F]. All colors are valid Rust tokens, for example #123abc is interpreted as # followed by the number 123 with suffix abc, while #abc123 is # followed by the string "123abc". There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.

Guide-level explanation

Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.

Reference-level explanation

This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not u8, i16 etc.) the compiler will reject the number with an 'invalid suffix' error.

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like 1e_______________23) but nevertheless it is handled correctly.

Drawbacks

I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (1em would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.

Rationale and alternatives

The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.

Prior art

There are a number of previous attempts to implement this functionality: #111628, #79912.

Unresolved questions

None

Future possibilities

A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.

r: @petrochenkov, since they reviewed #79912.