JavaScript, Regex, and Unicode (original) (raw)

Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

Here again are the affected tokens, along with their definitions:

\d — Digits.
\s — Whitespace.
\w — Word characters.
\D — All except digits.
\S — All except whitespace.
\W — All except word characters.
. — All except newlines.
^ (with /m) — The positions at the beginning of the string and just after newlines.
$ (with /m) — The positions at the end of the string and just before newlines.
\b — Word boundary positions.
\B — Not word boundary positions.

All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:

Digit — The characters 0-9 only.
Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
Word character — The characters A-Z, a-z, 0-9, and _ only.
Word boundary — The position between a word character and non-word character.
Newline — The line feed, carriage return, line separator, and paragraph separator characters.

Here again are the newline characters, with their character codes:

\u000a — Line feed — \n
\u000d — Carriage return — \r
\u2028 — Line separator
\u2029 — Paragraph separator

Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n") returns true.

As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s according to ECMA-262 3rd Edition and Unicode 5.1:

\u0009 — Tab — \t
\u000a — Line feed — \n — (newline character)
\u000b — Vertical tab — \v
\u000c — Form feed — \f
\u000d — Carriage return — \r — (newline character)
\u0020 — Space
\u00a0 — No-break space
\u1680 — Ogham space mark
\u180e — Mongolian vowel separator
\u2000 — En quad
\u2001 — Em quad
\u2002 — En space
\u2003 — Em space
\u2004 — Three-per-em space
\u2005 — Four-per-em space
\u2006 — Six-per-em space
\u2007 — Figure space
\u2008 — Punctuation space
\u2009 — Thin space
\u200a — Hair space
\u2028 — Line separator — (newline character)
\u2029 — Paragraph separator — (newline character)
\u202f — Narrow no-break space
\u205f — Medium mathematical space
\u3000 — Ideographic space

To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.

Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.

JavaScript, Regex, and Unicode (original) (raw)

Post navigation