JNI UTF-8 encoding bug with some characters (original) (raw)
Ariel Weisberg ariel at weisberg.ws
Tue Jun 5 18:12:29 UTC 2012
- Previous message: JNI UTF-8 encoding bug with some characters
- Next message: Code Review Request: 7173645: (props) System.getProperty("os.name") should return "Windows Server 2012" for Windows Server 2012
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
Thanks I will do the conversion in Java then.
Ariel
On Tue, Jun 5, 2012, at 10:49 AM, Xueming Shen wrote:
Hi Ariel,
The Java UTF-8 charset (sun.nio.cs.UTF8) is updated back to jdk7 to follow Unicode Corrigendum [1] (CR#4486841) and is furthered updated in JDK8 (#7096080) to fully conform with the Standard. As the result, the Java UTF-8 charset now only encodes and decodes supplementary character into 4 bytes utf-8 byte sequence. However, we did not do the same thing for vm's jni-utf-8 implementation, which still encode/decodes the supplementary into 6 bytes (pair of surrogates, 3 bytes each). This was the decision we made back then with the assumption that the jni-utf-8 is mainly for "internal" information exchange (you are not supposed to use the result to exchange the information with an "external" system), as long as it provides a round-trip conversion, should be not an issue. The character you are using here is a supplementary character, this is why you are seeing the difference here. -Sherman [1] http://www.unicode.org/versions/corrigendum1.html On 06/05/2012 09:06 AM, Ariel Weisberg wrote: > Hi, > > Here is a link to an updated test case that simplifies the string being > tested to just the problem character, and fixes a bug in determining the > length of the array returned by GetStringUTFChars. > > https://s3.amazonaws.com/com.voltdb.aweisberg/utf8encodingbug2.tgz > > Thanks, > Ariel > > On Tue, Jun 5, 2012, at 11:38 AM, Ariel Weisberg wrote: >> Hi all, >> >> Not sure what list this should go to. >> >> I found an issue with JNI's GetStringUTFChars which is supposed to >> return a Java string in UTF-8 encoding. There is an attached test case. >> I tested on Ubuntu 12.04 (Linux aweisberg-desktop 2.6.32-41-generic >> #89-Ubuntu SMP Fri Apr 27 22🔞56 UTC 2012 x8664 GNU/Linux) and CentOS >> 5 (Linux volt3b 2.6.18-308.4.1.el5 #1 SMP Tue Apr 17 17:08:00 EDT 2012 >> x8664 x8664 x8664 GNU/Linux) with JDK 6 update 32 and JDK 7 update 4. >> >> For the following string "â��x一xxéyyԱ" I find that the first character is >> encoded correctly, but the second character >> (http://www.fileformat.info/info/unicode/char/1f032/index.htm) comes out >> with an invalid code point. >> >> The result of String.getBytes("UTF-8") is >> c3a2f09f80b278e4b8807878c3a97979d4b1 and this matches the output I get >> from defining the string as a constant in C++. >> >> The result of GetStringUTFChars is c3a2eda0bcedb0b278e4b8. >> >> See this test case >> (https://s3.amazonaws.com/com.voltdb.aweisberg/utf8encodingbug.tgz) >> for a reproducer and how I displayed the values. >> >> Thanks, >> Ariel
- Previous message: JNI UTF-8 encoding bug with some characters
- Next message: Code Review Request: 7173645: (props) System.getProperty("os.name") should return "Windows Server 2012" for Windows Server 2012
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]