Codereview request for 7189363: Regex Pattern compilation buggy for special sequences (original) (raw)
Xueming Shen xueming.shen at oracle.com
Wed Aug 8 21:58:37 UTC 2012
- Previous message: [8] Review request for JEP 127: Improve Locale Data Packaging and Adopt Unicode CLDR Data
- Next message: Codereview request for 7189363: Regex Pattern compilation buggy for special sequences
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi
It appears the optimization (flatten a binary tree structure into a branch/ switch) we put into JDK6 for alternation operation [1] has problem if the first construct is a group "(...)" followed by a greedy/reluctant "once or not at all" quantifier. For example regex "(a)?bc|d" or "(a)??bc|d" can't be found for input "d", while "a?bc|d" or "a??bc|d" just works fine.
The root cause is that expr() mistakenly to use the Branch node from the sub-expression (in above case, the branch node for "(a)?" instead of using its own Branch node (because the incorrect use of "prev instance of Branch", see the diff), which basically messes up the internal regex node tree [2]
The webrev is at http://cr.openjdk.java.net/~sherman/7189363/webrev
Thanks, -Sherman
PS: I'm surprised this bug has not been noticed until now. One of the reasons is that the "messed up" regex node tree actually works for most of the cases for this particular regex, for example
Pattern.compile("(a)?bc|d").matcher("d").matches() Pattern.compile("(a)?bc|d").matcher("bc").matches() Pattern.compile("(a)?bc|d").matcher("abc").matches() and even Pattern.compile("(a)?bc|de").matcher("de").find()
all return true. The "find()" for single "d" fails only because the "minimum length" check (see [2] for details, if interested)
[1] http://cr.openjdk.java.net/~sherman/5013651_6271399_6342544/ [2] http://cr.openjdk.java.net/~sherman/7189363/nodes
- Previous message: [8] Review request for JEP 127: Improve Locale Data Packaging and Adopt Unicode CLDR Data
- Next message: Codereview request for 7189363: Regex Pattern compilation buggy for special sequences
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]