feat: add full Unicode support for Javascript identifiers by ctjlewis · Pull Request #3647 · google/closure-compiler (original) (raw)
Further resolves #3639, aims to prevent 99.999% of future issues that might occur from using non-English characters (except the few Greek and Latin chars manually included) in an identifier. In the current build, the following errors will occur (including a known issue with U+FF3F
):
ERROR - [JSC_PARSE_ERROR] Parse error. Character '〩' (U+3029) is not a valid identifier start char
1| var a〩 = 10;
^
ERROR - [JSC_PARSE_ERROR] Parse error. Character '〺' (U+303A) is not a valid identifier start char
1| var 〺b = 10;
^
The caret is marked in the wrong spot for the last example, but the msg makes it clear that it's U+FF3F
causing problems:
ERROR - [JSC_PARSE_ERROR] Parse error. Character '_' (U+FF3F) is not a valid identifier start char
1| var aesthetic_;
^
After:
var a〩 = 10;
var a\u3029=10;
var 〺b = 10;
var \u303ab=10;
var aesthetic_;
var \uff41\uff45\uff53\uff54\uff48\uff45\uff54\uff49\uff43\uff3f;
Existing logic is retained for isIdentifierStart
and isIdentifierPart
, but if the character exists outside of the hardcoded ranges it is checked against regexpu-compiled patterns provided by TC39.
I would not expect that regexpu approach is very performant, but all previous optimizations were migrated into the UnicodeMatch
class, so the only impact is if you use a previously unsupported character (which would have thrown an error and exited altogether prior to this patch). I will likely transform the matches into large (a <= ch & ch <= b)
blocks to further optimize, but I wanted to put this up for review and feedback (especially regarding short-circuiting and branching conditions, which I am not super familiar with).
Apologies if this is an unwanted PR, I just figured I'd take a crack at it.
Update
I figured manually dumping a chain of bitwise operators would be faster than regexpu, but the compiled regex is consistently faster than the raw bitwise chain unless short-circuiting is used: https://repl.it/@christiantjl/UnicodeMatch
Jupyter notebook for generating the bitwise chain at: https://github.com/christiantjl/unicode-category-ranges/blob/master/Unicode.ipynb