msg119184 - (view) |
Author: Maciek J (Maciek.J) |
Date: 2010-10-20 01:43 |
Not sure if this is a Python problem or an expat problem, but I get truncated data while parsing XML documents. This particular project is for parsing an XML file of Wikipedia dump. The attached files are: * xml-parse-revisions.py - parser script * revision-test.xml - input XML * revision-test.xml.sql - output XML * revision_create.sql - not really needed for this test case, but attached for completeness You can notice that the output file sometimes contains too short values for the "timestamp". Also note that if you add whitespace to the input XML, then different timestamps will be truncated. My Python is 2.6.6. |
|
|
msg119202 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-10-20 11:54 |
For other reviewers, I'm reposting just his python program as a text file. Maciek: I myself don't know enough about expat to comment, but is it possible you have an issue similar to issue 10026? |
|
|
msg119229 - (view) |
Author: Maciek J (Maciek.J) |
Date: 2010-10-20 18:05 |
Hm... It turns out that there is a "buffer_text" attribute: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.buffer_text And setting this attribute to "True" seems to solve the problem. It solves my problem, but docs are still very confusing. I see two things that should be fixed: 1. In CharacterDataHandler description it should be explicitly noted that data may be chunked even if it is short(!). 2. Description of buffer_text attribute should contain a notice that data may also be arbitrary chunked if this is set to False. My data _was_not_ chunked at new line characters (as the description suggest). It was chunked in the middle of a sentence (there were no whitespace in it!). |
|
|
msg119342 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-10-21 22:02 |
Would you like to turn your suggestions (+ hinting at buffer_text someplace) into a patch for Doc/library/pyexpat.rst? |
|
|
msg119357 - (view) |
Author: Maciek J (Maciek.J) |
Date: 2010-10-22 00:45 |
I'm not familiar with the rst format, but I hope this works. |
|
|
msg121005 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-11-12 01:42 |
Thanks for the patch. There are a few typos (pices, recive) and markup glitches, which you can fix if you’d like to learn more about the markup, or else leave to someone else. Those glitches are: bad indentation, missing blank line to make a new paragraph, text in backquotes without a :role: (or double backquotes for False). From a checkout, run “make html” to see the result. |
|
|
msg121006 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-11-12 01:43 |
Also, s/receive few calls/receive more than one call/ (clearer IMO). |
|
|
msg121161 - (view) |
Author: Maciek J (Maciek.J) |
Date: 2010-11-13 23:31 |
Couldn't compile to html at the moment, but it should be fine anyway. Note that I didn't wanted to start a new paragraph (I'm guessing you meant the sentence at line 13 of the patch) as there was no new paragraph in a previous version. |
|
|
msg142429 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2011-08-19 12:27 |
I was about to commit an edited version of your patch (attached) but then I thought we should check whether this isn’t really a bug. I just don’t see why expat would chunk without paying heed to the newlines if it is supposed to chunk at newlines. |
|
|
msg142439 - (view) |
Author: Fred Drake (fdrake)  |
Date: 2011-08-19 12:58 |
Chunking of the data is expected with Expat. There are no promises about *where* chunks are broken; the underlying behavior will break at line endings, but is not limited to that. Setting buffer_text informs the Python wrapper that it's allowed to combine the chunks reported by the Expat library; this was made optional since it could affect working applications (changing the default with the move to Python 3 may have been acceptable, though). |
|
|
msg407204 - (view) |
Author: Irit Katriel (iritkatriel) *  |
Date: 2021-11-28 12:59 |
Eric's patch needs to be converted to a GitHub PR. |
|
|