Issue 24079: xml.etree.ElementTree.Element.text does not conform to the documentation (original) (raw)

Created on 2015-04-29 23:34 by jlaurens, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
etree-text.patch	martin.panter,2015-05-30 01:34	review
etree-text.v2.patch	martin.panter,2015-06-03 13:37	review

Messages (23)
msg242256 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-29 23:34
The documentation for xml.etree.ElementTree.Element.text reads "If the element is created from an XML file the attribute will contain any text found between the element tags." import xml.etree.ElementTree as ET root3 = ET.fromstring('TEXT') print(root3.text) CURRENT OUTPUT None "TEXT" is between the elements tags but does not appear in the output BTW : this is well formed xml and has nothing to do with tail.
msg242257 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-04-30 02:35
(This issue is a followup to your Issue24072.) Again, while the ElementTree documentation is certainly not nearly as complete as it should be, I don't think this is a documentation error per se. The key issue is: with which element is each text string associated? Perhaps this example will help: >>> root4 = ET.fromstring('ATEXTBTEXTBTAIL') >>> root4 <Element 'a' at 0x10224c228> >>> root4.text 'ATEXT' >>> root4.tail >>> root4[0] <Element 'b' at 0x1022ab278> >>> root4[0].text 'BTEXT' >>> root4[0].tail 'BTAIL' As in your original example, any text following the element b is associated with b's tail attribute until a new tag is found, pushing or popping the tree stack. While the description of the "text" attribute does not explicitly state this, the "tail" attribute description immediately following it does. This is also explained in more detail in the ElementTree resources on effbot.org that are linked to from the Python Standard Library documentation. Nevertheless, it probably would be helpful to expand the documentation on this point if someone is willing to put together a documentation patch for review. With regard to your comment about "well formed xml", I don't think there is anything in the documentation that implies (or should imply) that the distinction between the "text" attribute and the "tail" attribute has anything to do with whether it is well-formed XML. The tutorial for the third-party lxml package, which provides another implementation of ElementTree, goes into more detail about why, in general, both "text" and "tail" are necessary. https://docs.python.org/3/library/xml.etree.elementtree.html#additional-resources http://effbot.org/zone/element.htm#text-content http://lxml.de/tutorial.html#elements-contain-text
msg242263 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-04-30 06:38
> this is well formed xml and has nothing to do with tail. In fact, it does have something to do with tail. The 'TEXT' is a captured as the tail of element b: >>> root3 = ET.fromstring('TEXT') >>> root3[0].tail 'TEXT'
msg242264 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-04-30 07:04
I agree that the wording in the documentation isn't great: """ text The text attribute can be used to hold additional data associated with the element. As the name implies this attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found between the element tags. tail The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag. """ Special cases that no-one uses (sticking non-string objects into text/tail) are given too much space and the difference isn't explained as needed. Since the distinction between text and tail is a (great but) rather special feature of ElementTree, it needs to be given more room in the docs. Proposal: """ text The text attribute holds the immediate text content of the element. It contains any text found up to either the closing tag if the element has no children, or the next opening child tag within the element. For text following an element, see the `tail` attribute. To collect the entire text content of a subtree, see `tostring`. Applications may store arbitrary objects in this attribute. tail The tail attribute holds any text that directly follows the element. For example, in a document like ``TextBTailCTail``, the `text` attribute of the ``a`` element holds the string "Text", and the tail attributes of ``b`` and ``c`` hold the strings "BTail" and "CTail" respectively. Applications may store arbitrary objects in this attribute. """
msg242268 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 11:35
Since the text and tail notions seem tightly coupled, I would vote for a more detailed explanation in the text doc and a forward link in the tail documentation. """ text The text attribute holds the text between the element's begin tag and the next tag or None. The tail attribute holds the text between the element's end tag and the next tag or None. For "1234" xml data, the a element has None for both text and tail attributes, the b element has text '1' and tail '4', the c element has text '2' and tail None, the d element hast text None and tail '3'. To collect the inner text of an element, see `tostring` with method 'text'. Applications may store arbitrary objects in this attribute. tail The tail attribute holds the text between the element's end tag and the next tag or None. See `text` for more details. Applications may store arbitrary objects in this attribute. """ It is very important to mention that the 'text' attribute does not always hold a string contrary to what would suggest its name. BTW, I was not aware of the tostring method with 'text' argument. The fact is that the documentation reads "Returns an (optionally) encoded string containing the XML data." which is misleading because the text is not xml data in general. This also needs to be rephrased or simply removed.
msg242279 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 17:56
The totsstring(..., method='text') is not suitable for the inner text because it adds the tail of the top element. A proper implementation would be def innertext(elt): return (elt.text or '') +''.join(innertext(e)+e.tail for e in elt) that can be included in the doc instead of the mention of the to string trick
msg242280 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 18:03
Erratum def innertext(elt): return (elt.text or '') +''.join(innertext(e)+(e.tail or '') for e in elt)
msg243032 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-05-13 04:06
Another problem with tostring() is that it seems you have to call it with encoding="unicode". Perhaps it would be better to suggest code like "".join(element.itertext())? I would also improve on Jérôme’s version by making the None case more explicit. And perhaps both attributes can be defined together, rather than giving a half-hearted definition linking between them: .. attribute:: text .. attribute:: tail The text attribute holds any text between the element's begin tag and the next tag. The tail attribute holds any text between the element's end tag and the next tag. These attributes are set to ``None`` if there is no text. For example, in the XML data ``1234``, the a element has ``None`` for both text and tail attributes, the b element has text ``"1"`` and tail ``"4"``, the c element has text ``"2"`` and tail ``None``, the d element has text ``None`` and tail ``"3"``. To collect the inner text of an element, use :meth:`itertext`, for example ``"".join(element.itertext())``. Applications may store arbitrary objects in these attributes.
msg244434 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-05-30 01:34
Here is a patch with my suggestion. Let me know what you think.
msg244445 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-05-30 05:35
IMHO less clear and less correct than the previous suggestions.
msg244446 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-05-30 05:40
Seems like a good idea to explain "text" and "tail" in one section, though. That makes "tail" easier to find for those who are not used to this kind of split (and that's basically everyone who needs to read the docs in the first place).
msg244744 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-06-03 13:37
Okay, here is a version with most of the wording reverted to Jérôme’s suggestion. I only left my itertext() example, and the grouping of text and tail together. If there are any more bits that are incorrect or unclear please identify them.
msg244869 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-06-05 15:08
Looks good to me.
msg247736 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 06:09
could we apply this patch, please?
msg247740 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-07-31 07:17
I note that the current wording for both "text" and "tail" are careful to allow for the most general use of the Element class, that is, that it may be used in non-XML contexts, for example: "The text attribute can be used to hold additional data associated with the element. As the name implies this attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found between the element tags." The proposed patch downplays that generality. How about modifying the original wording so that the description starts something like: "These attributes can be used to hold additional [...] application-specific object. If the element is created from an XML file, the text attribute holds either the text between the element'sstart tag and its first child or end tag, or ``None``and the tail attribute holds either the text [...]."
msg247741 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 07:35
> The proposed patch downplays that generality. That is completely intentional. Almost all readers of the documentation will first need to understand the difference between text and tail before they can go and think about any more advanced use cases that will almost certainly fail on their first serialisation attempts. The most important aim of the new phrasing is therefore to make that difference clear. Everything else is secondary, although still worth mentioning.
msg247744 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-07-31 13:34
I think Ned’s version is an acceptable solution (modulo some punctuation) to the original problem, although I do agree with Stefan that downplaying the generality would be even better. Perhaps we could add a qualifier, like “The text attribute [normally] holds . . .”
msg247745 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 14:02
Personally, I would prefer getting the improved version applied over bikeshedding for another couple of months. But maybe that's just me.
msg248481 - (view)	Author: Robert Collins (rbcollins) *	Date: 2015-08-12 22:34
So it is downplayed but it is still documented as being application usable. I'll give this another week for Ned to reply, then commit it in the absence of a reply: I think its ok as is. I'd be ok with a tweaked version along the lines Ned proposed too: both ways are better than whats in tree today.
msg248752 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-08-18 02:17
New changeset d3cda8cf4d42 by Ned Deily in branch '2.7': Issue #24079: Improve description of the text and tail attributes for https://hg.python.org/cpython/rev/d3cda8cf4d42 New changeset ad0491f85050 by Ned Deily in branch '3.4': Issue #24079: Improve description of the text and tail attributes for https://hg.python.org/cpython/rev/ad0491f85050 New changeset 17ce3486fd8f by Ned Deily in branch '3.5': Issue #24079: merge from 3.4 https://hg.python.org/cpython/rev/17ce3486fd8f New changeset 3c94ece57c43 by Ned Deily in branch 'default': Issue #24079: merge from 3.5 https://hg.python.org/cpython/rev/3c94ece57c43
msg248753 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-08-18 02:20
Thanks for all of your contributions on this. I've committed a version along the lines I suggested along with Martin's example.
msg248760 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-08-18 06:20
The "can store arbitrary objects" sentence is now duplicated, and still way too visible. I have to read three sentences until it tells me what I need to know.
msg248762 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-08-18 06:29
I think the first two sentences can simply be removed to fix this, without loss of readability or information.

History
Date	User	Action	Args
2022-04-11 14:58:16	admin	set	github: 68267
2015-08-18 06:29:21	scoder	set	messages: +
2015-08-18 06:20:17	scoder	set	messages: +
2015-08-18 02:20:53	ned.deily	set	status: open -> closedtype: behavior -> messages: + resolution: fixedstage: commit review -> resolved
2015-08-18 02:17:03	python-dev	set	nosy: + python-devmessages: +
2015-08-12 22:34:53	rbcollins	set	nosy: + rbcollinsmessages: +
2015-07-31 14:02:32	scoder	set	messages: +
2015-07-31 13:34:14	martin.panter	set	messages: +
2015-07-31 07:35:52	scoder	set	messages: +
2015-07-31 07:17:17	ned.deily	set	messages: +
2015-07-31 06:09:02	scoder	set	messages: +
2015-07-07 00:24:48	martin.panter	set	stage: patch review -> commit review
2015-06-05 15:08:48	scoder	set	messages: +
2015-06-03 13:37:14	martin.panter	set	files: + etree-text.v2.patchmessages: +
2015-05-30 05:40:18	scoder	set	messages: +
2015-05-30 05:35:57	scoder	set	messages: +
2015-05-30 01:34:13	martin.panter	set	files: + etree-text.patchversions: + Python 3.6messages: + components: + XMLkeywords: + patchstage: needs patch -> patch review
2015-05-13 04:06:07	martin.panter	set	nosy: + martin.pantermessages: +
2015-04-30 18:03:21	jlaurens	set	messages: +
2015-04-30 17:56:16	jlaurens	set	messages: +
2015-04-30 11:35:53	jlaurens	set	messages: +
2015-04-30 07:04:40	scoder	set	messages: +
2015-04-30 06:38:16	rhettinger	set	nosy: + rhettinger, scoder, eli.benderskymessages: +
2015-04-30 02:35:08	ned.deily	set	assignee: docs@pythoncomponents: + Documentation, - XMLversions: + Python 2.7, Python 3.5nosy: + docs@python, ned.deilymessages: + stage: needs patch
2015-04-29 23:34:55	jlaurens	create