[Python-Dev] XML codec? (original) (raw)
Adam Olsen rhamph at gmail.com
Thu Nov 8 23:45:51 CET 2007
- Previous message: [Python-Dev] XML codec?
- Next message: [Python-Dev] XML codec?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 11/8/07, Walter Dörwald <walter at livinglogic.de> wrote:
Martin v. Löwis wrote:
>> Then how about the suggested "xml-auto-detect"? > > That is better. OK. >>> Then, I'd claim that the problem that the codec solves doesn't really >>> exist. IOW, most XML parsers implement the auto-detection of encodings, >>> anyway, and this is where architecturally this functionality belongs. >> But not all XML parsers support all encodings. The XML codec makes it >> trivial to add this support to an existing parser. > > I would like to question this claim. Can you give an example of a parser > that doesn't support a specific encoding It seems that e.g. expat doesn't support UTF-32: from xml.parsers import expat p = expat.ParserCreate() e = "utf-32" s = (u"" % e).encode(e) p.Parse(s, True) This fails with: Traceback (most recent call last): File "gurk.py", line 6, in p.Parse(s, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1 Replace "utf-32" with "utf-16" and the problem goes away. > and where adding such a codec > solves that problem? > > In particular, why would that parser know how to process Python Unicode > strings? It doesn't have to. You can use an XML encoder to reencode the unicode string into bytes (forcing an encoding that the parser knows): import codecs from xml.parsers import expat ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True) >> Furthermore encoding-detection might be part of the responsibility of >> the XML parser, but this decoding phase is totally distinct from the >> parsing phase, so why not put the decoding into a common library? > > I would not object to that - just to expose it as a codec. Adding it > to the XML library is fine, IMO. But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. It's not even sufficient for XML:
- round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked.
- the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway.
[2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html
-- Adam Olsen, aka Rhamphoryncus
- Previous message: [Python-Dev] XML codec?
- Next message: [Python-Dev] XML codec?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]