Issue 7637: Improve 19.5. xml.dom.minidom doc (original) (raw)

  1. "When you are finished with a DOM, you should clean it up. This is necessary because some versions of Python do not support garbage collection of objects that refer to each other in a cycle. Until this restriction is removed from all versions of Python, it is safest to write your code as if cycles would not be cleaned up."

This appears to refer to early 2.x CPython versions without the gc module. Such (cryptic) back references are not appropriate for 3.x docs. Even in 3.x, immediate unlink might be a good idea, especially for CPython (which would then clean up immediately). But none of these issues are specific to DOM objects. Suggested replacement for the above and the current next sentence ("The way to clean up a DOM is to call its unlink() method:")

"When you are finished with a DOM, you can call the unlink method to encourage early cleanup of unneeded objects:"

Anything more is redundant with the doc for the method. ''' dom1.unlink() dom2.unlink() dom3.unlink() ''' One example at most is quite sufficient.

  1. '''Node.toxml([encoding]) Return the XML that the DOM represents as a string.

With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML.

With an explicit encoding [1] argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”. ''' I find this API a bit confusing.

In 3.x, "Return ... a string." means str (unicode), but the rest implies that 'string' should be 'string or bytes'.

"default encoding": what is it? ascii, utf-8 as almost implied, something in sys module (if so, please specify).

A cleaner API would have been 1. always return str (unicode) or 2. always return bytes, with encoding='utf-i' default or 3. return str if no encoding given or bytes if one is given, with no default.

  1. Revision of following antipattern example would be for 2.x also: ''' def getText(nodelist): rc = "" for node in nodelist: if node.nodeType == node.TEXT_NODE: rc = rc + node.data return rc ''' should be (not tested, but pretty straightforward)

def getText(nodelist): rc = [] for node in nodelist: if node.nodeType == node.TEXT_NODE: rc.append(node.data) return ''.join(rc)

Thank you for the patches, but I do not think this is quite done.

  1. "It is recommended that you always specify an encoding; you may use any encoding you like, but an argument of "utf-8" is the most common, avoid :exc:UnicodeError exceptions in case of unrepresentable text data." The phrase after the comma is garbled. I think it means something like "It avoids :exc:UnicodeError exceptions for unrepresentable text data."

  2. For Node.toprettyxml(indent="", newl="", encoding="") I think "There's also an encoding argument, that behaves like the corresponding argument of :meth:toxml." should simply say "The encoding argument behaves like the corresponding argument of :meth:toxml."

We already know there is one because it is there in the signature. I suspect saying so might date back to when there either was no signature or encoding was left out of it.