Message 62430 - Python tracker (original) (raw)
The W3C posted an item at http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic describing how their DTDs are being fetched up to 130M times per day.
The Python parsers are part of the problem, as noted by Paul Boddie on the python-advocacy list:
There are two places which stand out:
xml/dom/xmlbuilder.py xml/sax/saxutils.py
What gives them away is the way as the cause of the described problem is that they are both fetching things which are given as "system identifiers" - the things you get in the document type declaration at the top of an XML document which look like a URL.
If you then put some trace statements into the code and then try and parse something using, for example, the xml.sax API, it becomes evident that by default the parser attempts to fetch lots of DTD-related resources, not helped by the way that stuff like XHTML is now "modular" and thus employs lots of separate files in the DTD. This is obvious because you get something like this printed to the terminal:
saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod
Of course, the "best practice" with APIs like SAX is that you define your own resolver or handler classes which don't go and fetch DTDs from the W3C all the time, but this isn't the "out of the box" behaviour. Instead, implementers have chosen the most convenient behaviour which arguably involves the least effort in telling people how to get hold of DTDs so that they may validate their documents, but which isn't necessarily the "right thing" in terms of network behaviour. Naturally, since defining specific resolvers/handlers involves a lot of boilerplate (and you should try it in Java!) then a lot of developers just incur the penalty of having the default behaviour, instead of considering the finer points of the various W3C specifications (which is never really any fun).
Anyway, I posted a comment saying much the same on the blog referenced at the start of this thread, but we should be aware that this is default standard library behaviour, not rogue application developer behaviour.