[Python-Dev] XML codec? (original) (raw)
Walter Dörwald walter at livinglogic.de
Sat Nov 10 16:55:41 CET 2007
- Previous message: [Python-Dev] XML codec?
- Next message: [Python-Dev] XML codec?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Martin v. Löwis" sagte:
So what if the unicode string doesn't start with an XML declaration? Will it add one?
No. Ok. So the XML document would be ill-formed then unless the encoding is UTF-8, right?
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info? If it is, then yes, the document would be ill-formed.
The point of this code is not just to return whether the string starts with "<?xml" or not. There are actually three cases: Still, it's overly complex for that matter:
* The string does start with "<?xml" if s.startswith("<?xml"): return Yes * The string starts with a prefix of "<?xml", i.e. we can only decide if it starts with "<?xml" if we have more input. if "<?xml".startswith(s): return Maybe * The string definitely doesn't start with "<?xml". return No
This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.
What bit fiddling are you referring to specifically that you think is better done in C than in Python?
The code that checks the byte signature, i.e. the first part of detectxmlencodingstr(). I can't see any bit fiddling there, except for the bit mask of candidates. For the candidate list, I cannot quite understand why you need a bit mask at all, since the candidates are rarely overlapping.
I tried many variants and that seemed to be the most straitforward one.
I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data".
Can there be an XML document that is less then 4 bytes? I guess not.
- otherwise, implement annex F "literally". Make a dictionary of all prefixes that are exactly 4 bytes, i.e.
prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ... ..., "\0\x3c\0\x3f":"utf-16le"} try: return prefixes4[s[:4]] except KeyError: pass if s.startswith(codecs.BOMUTF16BE):return "utf-16be" ... if s.startswith("<?xml"): return getencodingfromdeclaration(s) return "utf-8"
get_encoding_from_declaration() would have to do the same yes/no/maybe decision.
But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted?
Servus, Walter
- Previous message: [Python-Dev] XML codec?
- Next message: [Python-Dev] XML codec?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]