[Python-Dev] XML codec? (original) (raw)

Adam Olsen rhamph at gmail.com
Fri Nov 9 21:35:11 CET 2007


On Nov 9, 2007 6:10 AM, Walter Dörwald <walter at livinglogic.de> wrote:

Martin v. Löwis wrote: >>> Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc >>> codecs to do the encoding. There's no need to create a magical >>> mystery codec to pick out which though. >> So the code is good, if it is inside an XML parser, and it's bad if it >> is inside a codec? > > Exactly so. This functionality just isn't a codec - there is no > encoding. Instead, it is an algorithm for detecting an encoding. And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?

It seems to me that parsing XML requires 3 steps:

  1. determine encoding
  2. decode byte stream
  3. parse XML (including handling of character references)

All an xml codec does is make the first part a side-effect of the second part. Rather than this:

encoding = detect_encoding(raw_data) decoded_data = raw_data.decode(encoding) tree = parse_xml(decoded_data, encoding) # Verifies encoding

You'd have this:

e = codecs.getincrementaldecoder("xml-auto-detect")() decoded_data = e.decode(raw_data, True) tree = parse_xml(decoded_data, e.encoding) # Verifies encoding

It's clear to me that detecting an encoding is actually the simplest part of all this (so long as there's an API to do it!) Putting it inside a codec seems like the wrong subdivision of responsibility.

(An example using streams would end up closer, but it still seems wrong to me. Encoding detection is always one way, while codecs are always two way (even if lossy.))

-- Adam Olsen, aka Rhamphoryncus



More information about the Python-Dev mailing list