[css-syntax] Urange and its problems · Issue #3588 · w3c/csswg-drafts (original) (raw)

(migrated from the mailing list, for easier tracking here)

Tab Atkins said:

History: CSS2.1 defined a special grammar token just for unicode
ranges, which was used in exactly one place: the 'unicode-range'
descriptor of @font-face. This special production caused bugs in
pages, where selectors like u+a { ... } were parsed as a
UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+)
IDENT(a)", like every other selector of that form was parsed. (This
isn't theoretical - Moz had a bug reported against it for this.)

When writing the Syntax spec, I tried to fix this by dropping the
unicode-range concept from the tokenizer, and instead handling it as a
complex construct of the existing tokens, like I did with <an+b>.
This kinda worked initially, but was really nasty. Since then, we
added scinot to numbers (like 1e3 for 1000), and this completely destroyed my ability to define cleanly - I can no longer use
the value of numeric tokens, and instead have to rely on the
"representation", which no browser stores or wants to store.

I want to go ahead and resolve this. I can see three options:

Keep what I'm currently doing. This requires browsers to hold onto
the string representation of numeric tokens (numbers and dimensions)
at least through initial parsing (longer if they're used in a custom
property).

Abandon this effort, go back to having a special unicode-range
token. Accept that this is weird and there are stupid side-effects,
like some selectors not working.

Define a new syntax that's actually simple to obtain from
the existing tokens¹. Deprecate the old syntax; require UAs to accept
the old syntax in the 'unicode-range' descriptor, but don't define how
they should do so. (Current UAs use context-sensitive retokenizing, I
think - once they realize they're in a unicode-range descriptor,
they'll retokenize the original text according to a special set of
rules.)

Thoughts?

¹ Simplest change is just to replace the + with a -, so you write
U-2016 for ‖. This makes unicode ranges always a single IDENT token,
plus possibly some trailing '?' DELIM tokens. You then have to parse
the token's value to make sure it's a valid range, but that's way, way
easier than the garbage fire I have to deal with from today's syntax.

fantasai said:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Tab Atkins said:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai fantasai.lists@inkedblade.net wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.

Florian Rivoal said:

On Apr 13, 2016, at 07:09, Tab Atkins Jr. jackalmage@gmail.com wrote:

On Tue, Apr 12, 2016 at 2:27 PM, fantasai fantasai.lists@inkedblade.net wrote:

Given unicode-range is already shipping
http://caniuse.com/#feat=font-unicode-range
I think #3 is a non-starter.

You might have misread - #3 is explicitly backwards-compatible. It
requires UAs to support the old syntax, it just doesn't describe how
they would do so.

As a UA implementor who has this on the roadmap, I don't like having a spec telling us to do something, without telling us how. All UAs would probably do fine at supporting the old syntax when it is correctly used, but I am much less confident that we'd all pick the same logic for error handling, and it is important that we all react the same way in the face of unknown/incorrect syntax.

I would imagine that reparsing unicode-range tokens in order to make
the selectors work would be easier than doing #1, no? Hanging onto
unicode-range tokens would be a lot less memory than hanging onto
numbers and dimensions, given they're used so rarely.

Yeah, it just means we have to reparse them everywhere except unicode-range.

Right, this feels ugly and error prone.

Florian Rivoal said:

On Apr 13, 2016, at 05:37, Tab Atkins Jr. jackalmage@gmail.com wrote:

Keep what I'm currently doing. This requires browsers to hold onto
the string representation of numeric tokens (numbers and dimensions)
at least through initial parsing (longer if they're used in a custom
property).

Does it really require that? Wouldn't it be good enough to hold onto the string representation of numeric tokens only when scinot is used? Given that scinot is pretty rare (and will stay that way), the memory requirement should be lower than storing the string representation of all numeric tokens.

Simon Sapin said:

How about this?

Same as 2, but tweak the Selector grammar to interpret unicode-range
tokens that don’t have question marks as: a type selector "u", followed
by a next-sibling combinator, followed by another type selector.

It’s weird, but it seems less messy to me than the alternatives.

Tab Atkins said:

Yeah. It really fucks up the grammar something fierce, so I think
I'd have to do it as a preprocessing step before matching the actual
Selectors grammar. And anything else that ever wants to use a + is
similarly affected; we seem to have settled on requiring spaces around
math + and I don't expect us to use + for anything else, but custom
properties would be stuck with this gotcha. :/