Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)

Xueming Shen xueming.shen at oracle.com
Fri Jun 8 22:35:27 UTC 2012

Previous message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 06/08/2012 12:07 PM, Ulf Zibis wrote:

Thanks Sherman!

Am 08.06.2012 20:36, schrieb Xueming Shen: On 06/08/2012 05:16 AM, Ulf Zibis wrote:

Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right?

No exactly what I meant. The engine currently works as

if the pattern is to match a "character" or "slice of characters" that has supplementary character embedded, engine will try to interpret the target char sequence as a sequence of Unicode code point.

If the pattern is not to match a "character" or match a slice of characters that does not have supplementary character embedded, the engine will try to interpret the char sequence as a sequence of char unit.

For example

Matcher m = Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02"); while (m.find()) { System.out.printf("<%d, %d>%n", m.start(), m.end()); }

The output is

<0, 2> <2, 4> <4, 6>

The target string is iterated code point by code point, but

Matcher m = Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02"); while (m.find()) { System.out.printf("<%d, %d>%n", m.start(), m.end());

}

The output is

<0, 0> <1, 1> <2, 2> <3, 3> <4, 4> <5, 5>

And the empty string pattern belongs to the latter case.

No, I'm not saying because the implementation works this way, therefor this is not a bug:-) Actually I'm starting to agree that we might not want to stop in the middle of a pair of surrogates, even in non-character case. But it might have some performance impact somewhere (if you iterate the CharSequence by code point).

-Sherman

"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?") ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01 "12\uD840\uDC02".replaceAll("[^0-9]", "?") ==> "12??" // 2 replacements for \uD840\uDC02 "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?") ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02

An empty String pattern is really a corner case here, it does not say anything about "character" So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python. -Ulf

Previous message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the core-libs-dev mailing list