Issue 17582: xml.etree.ElementTree does not preserve whitespaces in attributes (original) (raw)

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Emiliano Heyns, benjamin.peterson, duaneg, eli.bendersky, lwcolton, piro, python-dev, rhettinger, scoder, skrah
Priority:	normal	Keywords:	easy, patch

Created on 2013-03-30 16:26 by piro, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
17582-etree-whitespace.patch	lwcolton,2014-10-15 02:53	Patch for ElementTree	review
17582-etree-whitespace-test.patch	duaneg,2015-09-04 03:34	review

Messages (14)
msg185574 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2013-03-30 16:26
XML defines the following chars as whitespace [1]:: S ::= (#x20 \| #x9	#xD	#xA)+ However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2] >>> data = '\x09\x0a\x0d\x20' >>> data '\t\n\r ' >>> import xml.etree.ElementTree as ET >>> e = ET.Element('x', attr=data) >>> s = ET.tostring(e) >>> s '' >>> e1 = ET.fromstring(s) >>> data1 = e1.attrib['attr'] >>> data1 == data False >>> data1 ' \n ' cElementTree suffers of the same bug:: >>> import xml.etree.cElementTree as cET >>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr'] ' \n ' but not the external library lxml.etree:: >>> import lxml.etree as LET >>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr'] '\t\n\r ' The bug is analogous to #5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree). [1] http://www.w3.org/TR/REC-xml/#white [2] http://www.w3.org/TR/REC-xml/#AVNormalize
msg228569 - (view)	Author: Stefan Behnel (scoder) *	Date: 2014-10-05 14:19
> Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree). Agreed. Can you provide a patch?
msg229272 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2014-10-13 22:02
No, I cannot. I take the fact there has been no answer for more than 18 months as an acknowledgement that the issue is not deemed important by Python maintainers: it's not important for me either. I'm not a heavy xml user: just knowing that the Python XML libraries are unreliable and that by default I should use lxml is a sufficient solution to my sporadic xml uses. Your mileage should vary.
msg229398 - (view)	Author: Colton Leekley-Winslow (lwcolton) *	Date: 2014-10-15 02:53
Here is a patch. Please note that in your example \r is replaced by \n per 2.11: http://www.w3.org/TR/REC-xml/#sec-line-ends Also, the patch is only for ElementTree, I will investigate cElementTree but no promises.
msg229399 - (view)	Author: Colton Leekley-Winslow (lwcolton) *	Date: 2014-10-15 02:56
I sort of realized, does this mean lxml.etree would now be the offender, for not following 2.11 and leaving the \r as-is?
msg240525 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2015-04-12 02:54
The patch seems reason, though it needs a test.
msg249707 - (view)	Author: Duane Griffin (duaneg) *	Date: 2015-09-04 03:34
Here is a patch with a unit test for the new escaping functionality. I believe it covers all the new cases. Additional code is not required for cElementTree as the serialisation code is all Python.
msg249932 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-09-05 21:44
Stefan, can you opine on the patches and whether they should be backported?
msg250005 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-09-06 18:53
Patch and test look correct. They fix a bug that produces incorrect output, so I vote for backporting them. Most code won't see the difference as whitespace control characters are rare in attribute values. And code that uses them will benefit from correctness. Obviously, there might also be breakage in the rare case that code puts control characters into attribute values and expects them to disappear magically, but then it's the user code that is wrong here. Only issue is that serialisation is slow already and this change slows it down a bit more. Every attribute value will now be searched 8 times instead of 5 times. I added a minor review comment that would normally reduce it to 7. timeit suggests to me that the overall overhead is still tiny, though, and thus acceptable: $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'" "'\n' in s" 10000000 loops, best of 3: 0.0383 usec per loop $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'" "s.replace('\n', 'y')" 10000000 loops, best of 3: 0.151 usec per loop $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'; rep=s.replace" "rep('\n', 'y')" 10000000 loops, best of 3: 0.12 usec per loop
msg275969 - (view)	Author: Stefan Behnel (scoder) *	Date: 2016-09-12 06:06
Raymond, you might have meant me when assigning the ticket and not Stefan Krah, but since I'm actually not a core dev, I can't commit the patch myself. See my last comment, though, I reviewed the patch and it should get committed.
msg275972 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-09-12 06:23
New changeset 0a5596315cf0 by Raymond Hettinger in branch '3.5': Issue #17582: xml.etree.ElementTree nows preserves whitespaces in attributes https://hg.python.org/cpython/rev/0a5596315cf0
msg275973 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-09-12 06:24
Done.
msg357113 - (view)	Author: Emiliano Heyns (Emiliano Heyns)	Date: 2019-11-20 23:17
I don't see newlines currently preserved in attributes: elem = ET.parse(StringIO('')).getroot() print(ET.tostring(elem))
msg366245 - (view)	Author: Stefan Behnel (scoder) *	Date: 2020-04-12 12:55
Also see the later fix in issue 39011, where the EOL normalisation in attribute text was removed again. This change was applied in Py3.9.

History
Date	User	Action	Args
2022-04-11 14:57:43	admin	set	github: 61782
2020-04-12 12:55:44	scoder	set	messages: +
2019-11-20 23:17:11	Emiliano Heyns	set	versions: - Python 3.4, Python 3.5, Python 3.6nosy: + Emiliano Heynsmessages: + components: - XML
2016-09-12 06:24:55	rhettinger	set	status: open -> closedresolution: fixedmessages: +
2016-09-12 06:23:33	python-dev	set	nosy: + python-devmessages: +
2016-09-12 06:07:56	rhettinger	set	assignee: skrah -> rhettinger
2016-09-12 06:06:23	scoder	set	messages: +
2016-09-12 00:42:12	rhettinger	set	assignee: rhettinger -> skrahnosy: + skrah
2016-09-12 00:40:34	rhettinger	set	assignee: rhettinger
2016-09-12 00:01:38	kesara	set	versions: + Python 3.6, Python 3.7
2015-09-06 18:53:12	scoder	set	messages: +
2015-09-05 21:44:42	rhettinger	set	messages: +
2015-09-04 03:34:55	duaneg	set	files: + 17582-etree-whitespace-test.patchnosy: + duanegmessages: +
2015-04-14 06🔞50	rhettinger	set	nosy: + rhettinger
2015-04-12 02:54:13	benjamin.peterson	set	nosy: + benjamin.petersonmessages: +
2014-10-15 02:56:30	lwcolton	set	messages: +
2014-10-15 02:53:09	lwcolton	set	files: + 17582-etree-whitespace.patchnosy: + lwcoltonmessages: + keywords: + patch
2014-10-13 22:09:03	ezio.melotti	set	keywords: + easystage: needs patch
2014-10-13 22:02:24	piro	set	messages: +
2014-10-05 14:19:38	scoder	set	messages: + versions: - Python 2.7
2014-10-01 21:11:58	BreamoreBoy	set	nosy: + scoder, eli.benderskytype: behaviorversions: + Python 3.4, Python 3.5, - Python 3.2
2013-03-30 16:26:34	piro	create