Codereview request for 7014640: To add a metachar \R for line ending and character classes for vertical/horizontal ws \v \V \h \H (original) (raw)

Xueming Shen xueming.shen at oracle.com
Sat Apr 21 07:56:13 UTC 2012


Hi

Here are the webrev and blenderrev for the proposed change to add 5 new regex constructs \R \v \V \h \V.

\R: recommended by Unicode Regex TR#18 for matching all line ending characters and sequences, is equivalent to ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

\h, \v, \H and \V: matches any character considered to (not) be horizontal/vertical whitespace.

Webrev: http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html

Blenderrev: http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html

new Pattern api http://cr.openjdk.java.net/~sherman/7014640/Pattern.html

Here are couple notes regarding the spec/implementation.

(1) \v was implemented as \u000B ('\013'), but not documented (did not appear in our API doc as one supported construct, such as \t \r \n...). To define \v as a "general" construct for all vertical whitespace characters might trigger some compatibility concerns (more characters are now matched by \v). But given this is a never documented implementation detail and the \u000B is still being matched by \v, I would consider this as an acceptable behavior change.

(2) a predefined character class can appear inside another character class, for example you can have [...\v...], however, since it represents a "class" of character, so it can't be a start or end code point of a range inside a class, so you can have [a-b], but you can't have [\h-...] or [...-\h] (exception will be thrown). But for \v, since it was implemented as \u000B (VT), you were able to put it as a start or end value of a range, I feel it'd be better still keep it the way it worked before, so [\v-\v] works and will match the VT in this implementation.

(3) The newly added \h\v\H\V constructs are all "Unicode version" of character classes, the rest of the "predefined character class" family (\d\D\s\S\w\W) are ASCII only, you will have to turn on flag UNICODE_CHARACTER_CLASS to get the Unicode version of these constructs. This is kinda of inconsistent. Perl's corresponding constructs work in a similar way, all Perl's \d\D\s\S \w\W\v\V\h\H work in Unicode version, and to have a \a modifier to turn the \d\D\s\S\w\W back to ASCII mode but not for \h\v\H\V. We had the discussion back into JDK7 regarding the ASCII vs Unicode for these constructs, the decision then was to keep these predefined character classes (and POSIX character classes) ASCII by default, to have a flag UNICODE_CHARACTER_CLASS to turn them into Unicode version. Given there is NOT an ASCII version in Perl and we didn't have ASCII version in Java regex to trigger compatibility concern, I feel it might be better to just have a simple Unicode version of \h\v\H\V.

(4)\R is not a character class, since it matched \r\n.

This one will need to go through ccc process.

Thanks, -Sherman



More information about the core-libs-dev mailing list