Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)

Xueming Shen xueming.shen at oracle.com
Thu Jun 7 22:46:30 UTC 2012


Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.

-Sherman

On 6/7/2012 1:07 PM, Dawid Weiss wrote:

Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:

String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X"); the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16: AB��C XAXBX?X?XCX I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated. Dawid



More information about the core-libs-dev mailing list