Issue 8319: HTMLparser does not handle call to handle_data when a tag contains no data. (original) (raw)

Created on 2010-04-05 18:08 by wplappert, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
shannon_data.py wplappert,2010-04-06 03:01 The test program
Shannon-2010.0.02-extract.html wplappert,2010-04-06 03:01 the sample data
correct.out wplappert,2010-04-06 03:02 expected outpit, fix applied
wrong.out wplappert,2010-04-06 03:03 ouput with cuurent version of HTMLparser.py
shannon_data-v2.py wplappert,2010-04-24 20:45 modified test program
, a call to handle_data is not issued between handle_starttag and handle_endtag, but afterwards. The problem is in HTMLparser.goahead, where the position i and j are calculated. The code reads if i < j: self.handle_data(rawdata[i:j]) but it should be if i <= j: self.handle_data(rawdata[i:j]) If there is data between , everything works fine. I just checked the trunk of 2.6, this occurs in line 142 of Lib/HTMLParser.py. The size of HTMLParser.py is 13407 bytes, and is dated 'Feb 26 19:25'.
Messages (7)
msg102392 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 18:08
When parsing HTML and having a string along the lines of and
msg102414 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 21:19
The same code can be found in the 3.1 distribution.
msg102430 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:01
Here is a test program (shannon_data.py), some sample data (Shannon-2010.0.02-extract.html) and two output files (correct.out and wrong.out).
msg102433 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:21
in short the correct output should be 2/4/2010;6.3;11.1;0.8;6.5;;7.8;-5 versus 2/4/2010;6.3;11.1;0.8;6.5;7.8;-5 which implies that one element is missing in the output stream :)
msg102436 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-06 04:29
But changing the HTMLParser.goahead's way to treating tags from if i < j: self.handle_data(rawdata[i:j]) TO if i <= j: self.handle_data(rawdata[i:j] is not the correct way to deal with this problem. Theoretically, whatever it is doing seems correct. As there is no data, don't call handle_data. I can understand your testcase, and I think there is some other way to handle the test you are mentioning. If you change the above line, many of the existing tests may fail, so that *may not be* way to go.
msg104128 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-24 20:45
I have modified my program so I will check for data/no-data at the end of a td-call (td_end). Now it produces the correct result. I think you can close this issue.
msg104136 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-25 00:30
Thanks. Closing on submitter's note.
History
Date User Action Args
2022-04-11 14:56:59 admin set github: 52566
2010-04-25 00:30:27 orsenthil set status: open -> closedresolution: not a bugmessages: + stage: test needed -> resolved
2010-04-24 20:45:08 wplappert set files: + shannon_data-v2.pymessages: +
2010-04-21 22:09:31 eric.araujo set nosy: + eric.araujo
2010-04-12 13:57:42 pythonhacker set nosy: + pythonhacker
2010-04-06 04:29:51 orsenthil set messages: +
2010-04-06 03:21:34 wplappert set messages: +
2010-04-06 03:03:24 wplappert set files: + wrong.out
2010-04-06 03:02:33 wplappert set files: + correct.out
2010-04-06 03:01:47 wplappert set files: + Shannon-2010.0.02-extract.html
2010-04-06 03:01:11 wplappert set files: + shannon_data.pymessages: +
2010-04-05 21:19:58 wplappert set messages: + versions: + Python 3.1
2010-04-05 19:24:27 r.david.murray set priority: normalnosy: + orsenthilkeywords: + easystage: test needed
2010-04-05 18:28:49 wplappert set title: HTMLparser does not handle call to handle_data when a tag contains nor data. -> HTMLparser does not handle call to handle_data when a tag contains no data.
2010-04-05 18:08:54 wplappert create