Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)

Ulf Zibis Ulf.Zibis at gmx.de
Fri Jun 8 19:07:09 UTC 2012


Thanks Sherman!

Am 08.06.2012 20:36, schrieb Xueming Shen:

On 06/08/2012 05:16 AM, Ulf Zibis wrote:

Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right? "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?") ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01 "12\uD840\uDC02".replaceAll("[^0-9]", "?") ==> "12??" // 2 replacements for \uD840\uDC02 "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?") ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02

An empty String pattern is really a corner case here, it does not say anything about "character" So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.

-Ulf



More information about the core-libs-dev mailing list