6990617: Regular expression doesn't match if unicode character next to a digit. (original) (raw)
Stephen Flores stephen.flores at oracle.com
Fri Dec 16 06:41:10 UTC 2011
- Previous message: 6990617: Regular expression doesn't match if unicode character next to a digit.
- Next message: (corba) code review for 7046238: new InitialContext(); hangs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I removed test 1 since test tests the same state.
I updated the webrev.
Steve.
On 12/15/2011 01:44 PM, Xueming Shen wrote:
I would suggest to combine removeQEQuotingTest1 and 2 into one method, they kinda look redundant.
Otherwise the change looks fine for me. -Sherman
On 12/12/2011 08:16 PM, Stephen Flores wrote: Thanks Sherman,
I have added the regression test for the case below and added a "continue" statement after line 1622 to get the case to pass. I have updated the webrev. Steve. On 12/12/2011 02:22 PM, Xueming Shen wrote: Hi Steve,
The \x3[0-9] approach is interesting, it appears to solve the problem without paying a higher cost I originally thought (looking back, for example). The logic of initializing/setting/unsetting of "beginQuote" to true/false appears to be incorrect when there are multiple \Qn...\E in one pattern. Ln#1622 setting will always be followed by Ln#1630, if I read the code correctly. For example Pattern pattern = Pattern.compile("\011\Q1sometext\E\011\Q2sometext\E"); Matcher matcher = pattern.matcher("\t1sometext\t2sometext"); System.out.printf("find=%b%n", matcher.find()); will still return false? -Sherman On 12/09/2011 10:05 PM, Stephen Flores wrote: Please review the following webrev (includes new test to demonstrate the bug):
http://cr.openjdk.java.net/~sflores/6990617/ for bug: 6990617 Regular expression doesn't match if unicode character next to a digit. A DESCRIPTION OF THE PROBLEM : Unicode characters are represented as \+number. For instance, one could write: Pattern p = Pattern.compile("\011some text\012"); Matcher m = p.matcher("\tsome text\n"); System.out.println(m.find()); // yields "true" However, if we want to match a string with a digit next to the unicode character, it doesn't match (whether we "quote" the regular expression or not). Note the "1" next to the tab character (unicode 011). Pattern p = Pattern.compile("\011\Q1some text\E\012"); Matcher m = p.matcher("\t1some text\n"); System.out.println(m.find()); // yields "false" This happens because Pattern accepts either \0011 or \011 for the same character. From the javadoc: \0nn The character with octal value 0nn (0 <= n <= 7) \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) Evaluation: Pattern.RemoveQEQuoting() does NOT handle this situation correctly. The existing implementation now simply copies all ASCII.isAlnum() characters when handing a quote. Description of fix: In the method Pattern.RemoveQEQuoting any ASCII digit at the _beginning of a quote will now be prefixed by "\x3" (not just _ because this would be a backref). 0x30 is the ASCII code for '0'. Thanks, Steve.
- Previous message: 6990617: Regular expression doesn't match if unicode character next to a digit.
- Next message: (corba) code review for 7046238: new InitialContext(); hangs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]