Serialize UTF-8 string with Unicode escapes by pkierski · Pull Request #687 · open-source-parsers/jsoncpp (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Handling Unicode charactes, assuming strings are UTF-8
Thanks. Looks efficient. I hope you don't mind that I squashed.
I guess we should have tests for Unicode, but this is an excellent start.
Hi, I'm not happy with the fact that you assume that internal strings are UTF-8. We use JsonCpp in an old Windows application that uses MBCS strings (specifically, encoded with ISO-8859-1). This works well with JsonCpp versions up to 1.8.3, but this change breaks our Json output (yielding Gibberish where there were non-ASCII characters).
We do have UTF-8 output with the released version 1.8.3 even though we use ISO-8859-1 internally, why do you now have to change the API so that it requires UTF-8 strings internally?
Hmmm. That's interesting.
I don't know how to avoid assuming a string encoding. I guess we could keep a vector of Unicode characters instead of strings, but you'd still have to decode your MBCS strings before handing them to jsoncpp. I guess your data worked because we failed to translate those strings into Unicode JSON escapes, yes?
What do you propose?
I thing UTF-8 is the best choice for any C/C++ JSON library (see String and Character Issues in RFC 7159). As far as std::string doesn't contain information about encoding we can assume UTF-8 for best interoperability.
Alternatively it's possible to add kind of switch for serialization and parsing string like "byte array"; without any conversion, just for compatibility with code which makes such assumption.
I agree with @pkierski. @Arminius, would that work for you? We're suggesting a setting which would turn off Unicode escapes completely. Your strings would simply end up in the JSON file. Is that ok?
As far as I understand it now, it's basically a miracle that I ever got correct UTF-8 output when using MBCS strings, and even then it only worked because I never used strings containing characters that need to be escaped. I don't think it's a good idea to add such a switch - it would essentially revert to incorrect behaviour.
As for the legacy application I mentioned, I suppose we'll continue using JsonCpp 1.8.3 until we can sort out the necessary conversions ourselves.
However, what you really should do if you're going to assume UTF-8 strings is document that fact. It should be clear to anyone using JsonCpp that strings that are assigned to a Json::Value must be encoded as UTF-8.
does this necessary? Shouldn't the serialize result be utf8?
for example
Json::Value root;
root["name"] = "你的名字"; //Chinese, and use utf8
Json::FastWriter fwriter;
std::string retStr = fwriter.write(root);
std::cout << retStr;
before commit, the result is utf8 string like this:
but now, it turns out to be a unicode string like this:
{"name", "\u4f60\u7684\u540d\u5b57"}
I really want the former serialize result. @cdunn2001 @pkierski
JSON document is valid UTF-8 document without explicit UTF-8 encoding and escaping as far as you pass valid UTF-8 encoded string into appropriate Set() method. It's not true if someone use eg. one-byte, non-UTF-8 encoding like CP-1250. I know in you case escaping is wasting of space (6 or 12 bytes instead of 3 or 4). So my proposition is to add kind of flag: "check UTF-8 and use escape sequence for strings" vs. "pass strings content as is".
I disagree here, JSON is supposed to be a human-readable format. Escaping characters that don't need to be escaped is erroneous behaviour in that case. You shouldn't have to pass an extra flag to make sure your JSON output doesn't become an unreadable mess of escape sequences when you use non-Latin characters.
I agree with @Arminius .
I would stay with JsonCpp 1.8.3.
@Arminius, I agree. JSON does not require Unicode to be escaped. However, the binary representation of an unescaped Unicode document requires an encoding. This is not a question of the internal representation, but of the external.
Anyway, it seems that we've agreed to a solution, right? The behavior of escaping the non-ascii characters should be configurable. Any volunteers?
nyh mentioned this pull request