Issue 2174: xml.sax.xmlreader does not support the InputSource protocol (original) (raw)

Created on 2008-02-24 13:52 by ygale, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg62900 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 13:52
In the documentation for xml.sax.xmlreader.InputSource objects (section 8.12.4 of the Library Reference) we find that users of InputSource objects should use the following sequence to get their input data: 1. If the InputSource has a character stream, use that. 2. Otherwise, if the InputSource has a byte stream, use that. 3. Otherwise, open a URI connection to the system ID. The parse() method of IncrementalParser skips step 1. In addition, we need to add a method getSourceEncoding() to the XMLReader interface; if non-null, it will indicate to the parser that the input is a byte stream in the given encoding. The documentation should indicate what the parser should do if the XML itself announces that its encoding is something else. I propose that the parser should be required to raise an error in that case. See also #1483.
msg62904 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:09
See also: #1483 and #2175.
msg62907 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:18
Hmm. When getSourceEncoding() is None, there needs to be some way for the parser to distinguish between the cases where it is getting pre-decoded Unicode through a character stream, or where it is getting a byte stream with an unspecified encoding. In the latter case, it will have to look in the XML for an encoding declaration, or use UTF-8 by default). Note that expat only can handle the latter case.
msg62909 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:53
So I think there are two possibilities: 1. Use a special value for getSourceEnconding(), like "unicode", to indicate that this is a unicode character stream and not a byte stream. 2. Provide yet another method in the XMLReader interface: sourceIsCharacterStream(), returning a bool. There is a more drastic option: 3. Since expat doesn't support this stuff anyway, and perhaps not too many people have written parsers that do support it, dumb down the InputSource interface. Specifically, deprecate setCharacterStream(), getCharacterStream(), setEncoding() and getEncoding(), none of which are used by expat. Parsers should read the XML from the byte stream and use that to determine the encoding. That may upset some implementors of XML libraries though. They would each have to go to some trouble to provide their own proprietary and possibly incompatible mechanisms for this, if they need it. Perhaps a compromise fourth path would be to have subclasses of InputSource for the two cases of character stream and byte stream.
msg62940 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 21:16
Subclass of XMLReader would be needed, not InputStream.
msg64644 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008-03-28 18:42
It's certainly arguable that the current behavior is a bug, though I suspect it shouldn't be considered major since I've not seen any prior complaints about this. It should be easy to fix the bug you describe by taking the character stream and encoding it before feeding it to the XML parser; Expat can certainly be forced to take a known encoding, ignoring what's in the XML declaration. On the other hand, it's not at all clear that changing this is worthwhile. This API borrows quite literally from the Java SAX APIs; perhaps this separation of the character stream from the byte stream makes sense for some of the Java XML parsers, but I don't know that there are any Python parsers that benefit from that separation.
msg239312 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-26 07:29
Issue2175 has a patch that covers all three issues: , and . I hesitate what parts of the patch are worth to be applied to maintained releases.
msg239939 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-02 18:12
Fixed in (in 3.5 only).
msg240171 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2015-04-06 19:18
Given that this has languished this long, patching historical releases seems pointless.
History
Date User Action Args
2022-04-11 14:56:31 admin set github: 46427
2015-04-06 19:27:13 Arfrever set components: + XML
2015-04-06 19:26:18 Arfrever set stage: resolvedresolution: fixedcomponents: + Library (Lib), - Documentation, XMLversions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2
2015-04-06 19🔞26 fdrake set status: open -> closedmessages: +
2015-04-02 18:12:48 serhiy.storchaka set messages: +
2015-03-26 07:29:04 serhiy.storchaka set nosy: + serhiy.storchakamessages: +
2013-01-31 10:02:57 serhiy.storchaka set dependencies: + Expat parser parses strings only when XML encoding is UTF-8
2010-06-09 21:59:34 terry.reedy set versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2008-03-28 18:42:40 fdrake set priority: normal -> lowmessages: + components: - Library (Lib), Unicode
2008-03-20 02:52:31 jafo set priority: normalassignee: fdrakenosy: + fdrake
2008-02-24 21:16:40 ygale set messages: +
2008-02-24 14:53:29 ygale set messages: +
2008-02-24 14🔞28 ygale set messages: +
2008-02-24 14:09:57 ygale set messages: + components: + Unicode
2008-02-24 13:52:31 ygale create