Empty regexp replaceall and surrogate pairs results in corrupted utf16. (original) (raw)
Ulf Zibis Ulf.Zibis at gmx.de
Fri Jun 8 19:07:09 UTC 2012
- Previous message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right? "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?") ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01 "12\uD840\uDC02".replaceAll("[^0-9]", "?") ==> "12??" // 2 replacements for \uD840\uDC02 "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?") ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02
An empty String pattern is really a corner case here, it does not say anything about "character" So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.
-Ulf
- Previous message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Next message: Empty regexp replaceall and surrogate pairs results in corrupted utf16.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]