msg118376 - (view) |
Author: Peter Gyorko (gyorkop) |
Date: 2010-10-11 16:16 |
If I add a string to the response, which contains non-printable characters, the output will not be parsed by the most of the XML parsers (I tried with XML-RPC for PHP). Here is my quick and dirty fix: --- a/Lib/xmlrpclib.py +++ b/Lib/xmlrpclib.py @@ -165,9 +165,18 @@ def _decode(data, encoding, is8bit=re.compile("[\x80-\xff]").search): return data def escape(s, replace=string.replace): - s = replace(s, "&", "&") - s = replace(s, "<", "<") - return replace(s, ">", ">",) + res = '' + for char in s: + char_code = ord(char) + if (char_code < 32 and char_code not in (9, 10, 13)) or char_code > 126: + res += '\\x%02x' % ord(char) + else: + res += char + + res = replace(res, "&", "&") + res = replace(res, "<", "<") + res = replace(res, ">", ">") + return res if unicode: def _stringify(string): |
|
|
msg118434 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-10-12 16:52 |
Thanks for the report. Unfortunately, 2.6 only gets security fixes, not general bug fixes; can you tell if this applies to 2.7, 3.1 and 3.2? If you have a small script that displays the problem, please attach it. |
|
|
msg118439 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-10-12 17:17 |
The patch is incorrect. Even though this may let get these characters "through", the other end will have no clue that \x is meant as an escape. Please face the ugly truth: XML (and hence XML-RPC) just does not support these characters, see http://www.w3.org/TR/REC-xml/#NT-Char This is fixed slightly in XML 1.1 (which allows to refer to these characters by character reference), however, XML 1.1 is not used for XML-RPC, so this is not an option. I'm closing this as "won't fix". If there is interest in doing something about it, I could accept a patch that rejects non-Char characters with an exception, instead of sending ill-formed XML. I'd recommend to use a regular expression for that. |
|
|
msg118522 - (view) |
Author: Peter Gyorko (gyorkop) |
Date: 2010-10-13 14:16 |
The shortest code which can trigger this error is the following: >>> import xmlrpclib >>> print xmlrpclib.dumps(('\x01',)) As you can see, the escape method does not care about non-printable characters which can cause parsing error in the other side. My previous patch used \x to tell to the other side that the value contains some binary garbage. It you want to reject these binary bytes (which was not acceptable in my case), use this patch: --- a/xmlrpclib.py 2010-10-13 14:45:02.000000000 +0200 +++ b/xmlrpclib.py 2010-10-13 16:03:14.000000000 +0200 @@ -165,6 +165,9 @@ return data def escape(s, replace=string.replace): + if (None != re.search('[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', s)): + raise Fault(INVALID_ENCODING_CHAR, 'Non-printable character in string') + s = replace(s, "&", "&") s = replace(s, "<", "<") return replace(s, ">", ">",) An other idea: we may use CDATA (http://www.w3schools.com/xml/xml_cdata.asp) to transfer binary values... |
|
|
msg118558 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-10-13 18:32 |
No, CDATA is not an appropriate mechanism to encapsulate bytes in XML. The data in the CDATA section must still match the Char production, and it must still follow the encoding. |
|
|
msg118561 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-10-13 18:41 |
Thanks for the new patch. I suggest one style change before committing: if re.search(, s) is not None: Agree that CDATA is not appropriate. Martin: I’m reopening the bug in reaction to your message “I could accept a patch that rejects non-Char characters with an exception”, hope it’s fine. |
|
|
msg118564 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-10-13 19:06 |
Éric, I think the patch needs some rework. First, it is incorrect/incomplete: please see the Char definition for a complete list of characters that must be excluded. This then raises a Unicode vs. bytes issue, where invalid Unicode characters must be prohibited before the string actually being encoded (since apply the regex to the encoded string is not practical). The other side of the bytes vs. string issue is that the bytes really ought to be in self.encoding, which doesn't get checked, either. And then, it lacks tests. |
|
|
msg118716 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-10-14 21:03 |
My bad, I read too fast. Peter: Can you rework your patch to address Martin’s review? |
|
|