Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)

Ulf Zibis Ulf.Zibis at gmx.de
Fri Jun 8 12:24:51 UTC 2012


Oops, correction: StringBuilder sb = new StringBuilder(s1.length * 2 + 1); for (char c : s1.getChars()) sb.append('X').append(c); String s2 = sb.append('X').toString();

Am 08.06.2012 14:16, schrieb Ulf Zibis:

I tend to agree Dawid. Especially the comparison with Python behaviour is demonstrative.

Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, the isolated occurence of each char, or the occurence of the codepoint "\uD840\uDC00" ? Last, but not least, we should think about, which would be the common use case, an which would be more easy to work around. (I think, the current view on isolated chars is more easy to work around: StringBuilder sb = new StringBuilder(s1.length + 1).append('X'); for (char c : s1.getChars()) sb.append(c).append('X'); String s2 = sb.toString(); ) Additionally I like to discuss: "any possible zero-width position of the target String" If String length is l, maybe it's arguable, that position l is no valid position in the String. From the use case point of view, I think "P e t e r" as result of "Peter".replaceAll("", " ") is the most useful. -Ulf

Am 08.06.2012 13:14, schrieb Dawid Weiss: I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, transforming legal unicode into invalid values.

I checked with Python (3) and the behavior there is the expected one (it work at the unicode codepoint level rather than surrogate level). Where is the behavior of "" that you mention defined? I admit I couldn't find any reference to this in the documentation:

Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String I'm not saying you're wrong (and that pattern is definitely not common so it's probably academic discussion) but I'd like some concrete reference as to how an empty pattern should behave. To me consistency with the rest of the Pattern specification would be that it operates at "zero width position between unicode characters" not between any char[] value, even an incorrect one or in the middle of a surrogate. Dawid On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen at oracle.com> wrote: Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.

-Sherman

On 6/7/2012 1:07 PM, Dawid Weiss wrote: Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario: String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X"); the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16: AB��C XAXBX?X?XCX I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated. Dawid



More information about the core-libs-dev mailing list