6990617: Regular expression doesn't match if unicode character next to a digit. (original) (raw)
Stephen Flores stephen.flores at oracle.com
Sat Dec 10 06:05:33 UTC 2011
- Previous message: array and diamond
- Next message: 6990617: Regular expression doesn't match if unicode character next to a digit.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Please review the following webrev (includes new test to demonstrate the bug):
http://cr.openjdk.java.net/~sflores/6990617/
for bug: 6990617 Regular expression doesn't match if unicode character next to a digit.
A DESCRIPTION OF THE PROBLEM :
Unicode characters are represented as \+number. For instance, one could write: Pattern p = Pattern.compile("\011some text\012"); Matcher m = p.matcher("\tsome text\n"); System.out.println(m.find()); // yields "true"
However, if we want to match a string with a digit next to the unicode character, it doesn't match (whether we "quote" the regular expression or not). Note the "1" next to the tab character (unicode 011). Pattern p = Pattern.compile("\011\Q1some text\E\012"); Matcher m = p.matcher("\t1some text\n"); System.out.println(m.find()); // yields "false"
This happens because Pattern accepts either \0011 or \011 for the same character. From the javadoc:
\0nn The character with octal value 0nn (0 <= n <= 7)
\0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
Evaluation:
Pattern.RemoveQEQuoting() does NOT handle this situation correctly. The existing implementation now simply copies all ASCII.isAlnum() characters when handing a quote.
Description of fix:
In the method Pattern.RemoveQEQuoting any ASCII digit at the
beginning of a quote will now be prefixed by "\x3" (not just
because this would be a backref). 0x30 is the ASCII code for '0'.
Thanks,
Steve.
- Previous message: array and diamond
- Next message: 6990617: Regular expression doesn't match if unicode character next to a digit.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]