Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)
Dawid Weiss dawid.weiss at gmail.com
Thu Jun 7 20:07:07 UTC 2012
- Previous message: review request 7172551
- Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB��C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
- Previous message: review request 7172551
- Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]