Issue 11804: expat parser not xml 1.1 (breaks xmlrpclib) (original) (raw)

Created on 2011-04-08 09:33 by xrg, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
expat-test.py	xrg,2011-04-08 09:33	Test of expat compliance to xml 1.1

Messages (12)
msg133301 - (view)	Author: Panos Christeas (xrg)	Date: 2011-04-08 09:33
The expat library (in C level) is not xml 1.1 compliant, meaning that it won't accept characters \x01-\x08,\x0b,\x0c and \x0e-\x1f . At the same time, ElementTree (or custom XML creation, such as in xmlrpclib.py:694) allow these characters to pass through. They will get blocked on the receiving side. Since 2.7, the expat library is the default parser for xml-rpc, so it this is a regression, IMHO. According to the network principal, we should accept these characters gracefully. The attached test script demonstrates that we're not xml 1.1 compliant (but instead enforce the more strict 1.0 rule) References: http://bugs.python.org/issue5166 http://en.wikipedia.org/wiki/Valid_characters_in_XML
msg161341 - (view)	Author: Phil Daintree (Phil.Daintree)	Date: 2012-05-22 09:03
Another example - the following xml returned and displayed from verbose mode: 0001 001 002 100 121213 123456 291 321654 580 ABS ACTIVE AIRCON ALIEJA AMP ASSETS BAKE BRACE BYC CARRO CARTON CO COMPS CULOIL DECOR DVD E FOOD HDD INF LAB LINER LL MCNBI MEDS MODEL1 NEM PEÃ\x87AS PENS PHONE PLANT PRJCTR PROD SERV SOCKS SS SW TACON TEST12 VEGTAB ZFR will not parse with the error: File "/usr/lib/python2.7/xmlrpclib.py", line 557, in feed self._parser.Parse(data, 0) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 43, column 23 the following unicode characters on that line are the trouble: PEÃ\x87AS
msg161342 - (view)	Author: Phil Daintree (Phil.Daintree)	Date: 2012-05-22 09:05
The xml parses happily at http://www.w3schools.com/xml/xml_validator.asp
msg161346 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2012-05-22 12:09
In sample above, is "\x87" one character, or 4 ascii characters?
msg161491 - (view)	Author: Phil Daintree (Phil.Daintree)	Date: 2012-05-24 09:03
The field in question contains the utf-8 text: PEÇAS
msg161503 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2012-05-24 12:50
Yes, but where does this data come from? how did you feed it to the parser? And this does not relate to xml 1.1. BTW, I found this page about XML 1.1: http://www.cafeconleche.org/books/effectivexml/chapters/03.html """ Everything you need to know about XML 1.1 can be summed up in two rules: - Don't use it. - (For experts only) If you speak Mongolian, Yi, Cambodian, Amharic, Dhivehi, Burmese or a very few other languages and you want to write your markup (not your text but your markup) in these languages, then you can set the version attribute of the XML declaration to 1.1. Otherwise, refer to rule 1. """
msg161520 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-24 15:58
This has nothing to do with XML 1.1 (so closing this report as "won't fix"). The UTF-8 text that you present works very well: >>> p=xml.parsers.expat.ParserCreate(encoding="utf-8") >>> p.Parse("\xc3\x87</x", 1) 1 The character LATIN CAPITAL LETTER C WITH CEDILLA is definitely supported in XML 1.0, so there is no need for XML 1.1 here. If this still fails to parse for you, it may be because the input is actually different, e.g. >>> p=xml.parsers.expat.ParserCreate(encoding="utf-8") >>> p.Parse("Ã\x87", 1) Traceback (most recent call last): File "", line 1, in xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9 I.e. the input might contain the character &, #, 1, 9, 5, ;, and \x87. That is ill-formed UTF-8, and the parser is right to choke on it. Even if it was declared as XML 1.1, it will still be ill-formed, because it still would be invalid UTF-8.
msg161521 - (view)	Author: Panos Christeas (xrg)	Date: 2012-05-24 16:07
I'm reopening the bug, as your last comment does not cover the initial report. We are not talking about invalid UTF8 here, but legal low-ASCII values.
msg161697 - (view)	Author: Phil Daintree (Phil.Daintree)	Date: 2012-05-27 07:09
Well maybe this should be a different bug as it is clearly not xml 1.1 related as the linue in the xml gives away :-) To repeat the bug ... using the webERP demo data #!/usr/bin/env python import xmlrpclib x_server = xmlrpclib.Server('http://www.weberp.org/weberp/api/api_xml-rpc.php',verbose=True) #Get the stock items defined in the demo webERP installation StockList = x_server.weberp.xmlrpc_SearchStockItems('discontinued','0','admin','weberp') if StockList[0]==0: for StockID in StockList[1]: print str(StockID) The webERP xml-rpc server uses XMLRPC for PHP http://phpxmlrpc.sourceforge.net/
msg161699 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-27 07:38
Phil: it seems you have hijacked the bug report. Don't do that. If you want to report a bug, please create a new bug report. Structure it as follows: 1. this is what I did 2. this is what happened 3. this is what should have happened instead.
msg161700 - (view)	Author: Phil Daintree (Phil.Daintree)	Date: 2012-05-27 07:52
or for less data... #!/usr/bin/env python import xmlrpclib x_server = xmlrpclib.Server('http://www.weberp.org/weberp/api/api_xml-rpc.php',verbose=True) #Get the stock items defined in the webERP installation StockList = x_server.weberp.xmlrpc_SearchStockItems('units','cm','admin','weberp') if StockList[0]==0: for StockID in StockList[1]: print str(StockID)
msg161701 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-27 08:00
Panos: you are right. The original issue still exists. However, it is not a bug in Python, but a in the expat library. So I am now closing this report as out-of-scope for Python. There is a bug report open on expat requesting support for XML 1.1, see http://sourceforge.net/tracker/?func=detail&atid=110127&aid=891265&group_id=10127 This bug report is open since 2004. I see little hope that expat will support XML 1.1 within the next five years. I also fail to see the regression: expat has never supported XML 1.1. xmlrpclib always used expat, at least since Python 2.0. In any case, this report is about expat, not xmlrpclib, so any possible regression in xmlrpclib should be reported separately.

History
Date	User	Action	Args
2022-04-11 14:57:15	admin	set	github: 56013
2012-05-27 08:00:59	loewis	set	status: open -> closedresolution: wont fix
2012-05-27 08:00:51	loewis	set	messages: +
2012-05-27 07:52:45	Phil.Daintree	set	messages: +
2012-05-27 07:38:33	loewis	set	messages: +
2012-05-27 07:09:43	Phil.Daintree	set	messages: +
2012-05-24 16:07:54	xrg	set	status: closed -> openresolution: wont fix -> (no value)messages: +
2012-05-24 15:58:06	loewis	set	status: open -> closedresolution: wont fixmessages: +
2012-05-24 12:50:31	amaury.forgeotdarc	set	messages: +
2012-05-24 09:03:17	Phil.Daintree	set	messages: +
2012-05-22 12:09:20	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarcmessages: +
2012-05-22 09:05:05	Phil.Daintree	set	messages: +
2012-05-22 09:03:37	Phil.Daintree	set	nosy: + Phil.Daintreemessages: +
2011-04-08 22:46:21	ezio.melotti	set	nosy: + ezio.melotti
2011-04-08 18:10:27	santoso.wijaya	set	nosy: + santoso.wijaya
2011-04-08 12:53:20	pitrou	set	nosy: + loewis
2011-04-08 09:33:02	xrg	create