Issue 10149: [doc] Data truncation in expat parser (original) (raw)

Created on 2010-10-20 01:43 by Maciek.J, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
pyxml_error.zip Maciek.J,2010-10-20 01:43
xml-parse-revisions.py r.david.murray,2010-10-20 11:54
pyexpat.rst.patch Maciek.J,2010-10-22 00:45 Patch for docs review
pyexpat.rst.patch Maciek.J,2010-11-13 23:31 review
pyexpat.rst.patch eric.araujo,2011-08-19 12:27 review
Pull Requests
URL Status Linked Edit
PR 31629 open slateny,2022-03-01 07:55
Messages (11)
msg119184 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-20 01:43
Not sure if this is a Python problem or an expat problem, but I get truncated data while parsing XML documents. This particular project is for parsing an XML file of Wikipedia dump. The attached files are: * xml-parse-revisions.py - parser script * revision-test.xml - input XML * revision-test.xml.sql - output XML * revision_create.sql - not really needed for this test case, but attached for completeness You can notice that the output file sometimes contains too short values for the "timestamp". Also note that if you add whitespace to the input XML, then different timestamps will be truncated. My Python is 2.6.6.
msg119202 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-20 11:54
For other reviewers, I'm reposting just his python program as a text file. Maciek: I myself don't know enough about expat to comment, but is it possible you have an issue similar to issue 10026?
msg119229 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-20 18:05
Hm... It turns out that there is a "buffer_text" attribute: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.buffer_text And setting this attribute to "True" seems to solve the problem. It solves my problem, but docs are still very confusing. I see two things that should be fixed: 1. In CharacterDataHandler description it should be explicitly noted that data may be chunked even if it is short(!). 2. Description of buffer_text attribute should contain a notice that data may also be arbitrary chunked if this is set to False. My data _was_not_ chunked at new line characters (as the description suggest). It was chunked in the middle of a sentence (there were no whitespace in it!).
msg119342 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-21 22:02
Would you like to turn your suggestions (+ hinting at buffer_text someplace) into a patch for Doc/library/pyexpat.rst?
msg119357 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-22 00:45
I'm not familiar with the rst format, but I hope this works.
msg121005 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-12 01:42
Thanks for the patch. There are a few typos (pices, recive) and markup glitches, which you can fix if you’d like to learn more about the markup, or else leave to someone else. Those glitches are: bad indentation, missing blank line to make a new paragraph, text in backquotes without a :role: (or double backquotes for False). From a checkout, run “make html” to see the result.
msg121006 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-12 01:43
Also, s/receive few calls/receive more than one call/ (clearer IMO).
msg121161 - (view) Author: Maciek J (Maciek.J) Date: 2010-11-13 23:31
Couldn't compile to html at the moment, but it should be fine anyway. Note that I didn't wanted to start a new paragraph (I'm guessing you meant the sentence at line 13 of the patch) as there was no new paragraph in a previous version.
msg142429 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-08-19 12:27
I was about to commit an edited version of your patch (attached) but then I thought we should check whether this isn’t really a bug. I just don’t see why expat would chunk without paying heed to the newlines if it is supposed to chunk at newlines.
msg142439 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2011-08-19 12:58
Chunking of the data is expected with Expat. There are no promises about *where* chunks are broken; the underlying behavior will break at line endings, but is not limited to that. Setting buffer_text informs the Python wrapper that it's allowed to combine the chunks reported by the Expat library; this was made optional since it could affect working applications (changing the default with the move to Python 3 may have been acceptable, though).
msg407204 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-11-28 12:59
Eric's patch needs to be converted to a GitHub PR.
History
Date User Action Args
2022-04-11 14:57:07 admin set github: 54358
2022-03-01 07:55:34 slateny set keywords: + patchnosy: + slatenypull_requests: + <pull%5Frequest29752>stage: needs patch -> patch review
2021-11-28 12:59:47 iritkatriel set title: Data truncation in expat parser -> [doc] Data truncation in expat parsercomponents: + Library (Lib)keywords: + easy, - patchnosy: + iritkatrielversions: + Python 3.11, - Python 3.1, Python 2.7, Python 3.2messages: +
2011-08-19 12:58:55 fdrake set nosy: + fdrakemessages: +
2011-08-19 12:27:36 eric.araujo set files: + pyexpat.rst.patchmessages: +
2010-11-13 23:31:27 Maciek.J set files: + pyexpat.rst.patchmessages: +
2010-11-12 01:43:00 eric.araujo set messages: +
2010-11-12 01:42:08 eric.araujo set messages: +
2010-10-22 00:45:25 Maciek.J set files: + pyexpat.rst.patchkeywords: + patchmessages: +
2010-10-21 22:02:20 eric.araujo set nosy: + eric.araujomessages: +
2010-10-20 18:45:40 r.david.murray set nosy: + docs@pythonversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6assignee: docs@pythoncomponents: + Documentation, - XMLtype: behaviorstage: needs patch
2010-10-20 18:05:27 Maciek.J set messages: +
2010-10-20 11:54:32 r.david.murray set files: + xml-parse-revisions.pynosy: + r.david.murraymessages: +
2010-10-20 01:43:19 Maciek.J create