Issue 8319: HTMLparser does not handle call to handle_data when a tag contains no data. (original) (raw)

Created on 2010-04-05 18:08 by wplappert, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
shannon_data.py	wplappert,2010-04-06 03:01	The test program
Shannon-2010.0.02-extract.html	wplappert,2010-04-06 03:01	the sample data
correct.out	wplappert,2010-04-06 03:02	expected outpit, fix applied
wrong.out	wplappert,2010-04-06 03:03	ouput with cuurent version of HTMLparser.py
shannon_data-v2.py	wplappert,2010-04-24 20:45	modified test program

, a call to handle_data is not issued between handle_starttag and handle_endtag, but afterwards. The problem is in HTMLparser.goahead, where the position i and j are calculated. The code reads if i < j: self.handle_data(rawdata[i:j]) but it should be if i <= j: self.handle_data(rawdata[i:j]) If there is data between , everything works fine. I just checked the trunk of 2.6, this occurs in line 142 of Lib/HTMLParser.py. The size of HTMLParser.py is 13407 bytes, and is dated 'Feb 26 19:25'.

Messages (7)
msg102392 - (view)	Author: Winfried Plappert (wplappert)	Date: 2010-04-05 18:08
When parsing HTML and having a string along the lines of		and
msg102414 - (view)	Author: Winfried Plappert (wplappert)	Date: 2010-04-05 21:19
The same code can be found in the 3.1 distribution.
msg102430 - (view)	Author: Winfried Plappert (wplappert)	Date: 2010-04-06 03:01
Here is a test program (shannon_data.py), some sample data (Shannon-2010.0.02-extract.html) and two output files (correct.out and wrong.out).
msg102433 - (view)	Author: Winfried Plappert (wplappert)	Date: 2010-04-06 03:21
in short the correct output should be 2/4/2010;6.3;11.1;0.8;6.5;;7.8;-5 versus 2/4/2010;6.3;11.1;0.8;6.5;7.8;-5 which implies that one element is missing in the output stream :)
msg102436 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-04-06 04:29
But changing the HTMLParser.goahead's way to treating tags from if i < j: self.handle_data(rawdata[i:j]) TO if i <= j: self.handle_data(rawdata[i:j] is not the correct way to deal with this problem. Theoretically, whatever it is doing seems correct. As there is no data, don't call handle_data. I can understand your testcase, and I think there is some other way to handle the test you are mentioning. If you change the above line, many of the existing tests may fail, so that may not be way to go.
msg104128 - (view)	Author: Winfried Plappert (wplappert)	Date: 2010-04-24 20:45
I have modified my program so I will check for data/no-data at the end of a td-call (td_end). Now it produces the correct result. I think you can close this issue.
msg104136 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-04-25 00:30
Thanks. Closing on submitter's note.

History
Date	User	Action	Args
2022-04-11 14:56:59	admin	set	github: 52566
2010-04-25 00:30:27	orsenthil	set	status: open -> closedresolution: not a bugmessages: + stage: test needed -> resolved
2010-04-24 20:45:08	wplappert	set	files: + shannon_data-v2.pymessages: +
2010-04-21 22:09:31	eric.araujo	set	nosy: + eric.araujo
2010-04-12 13:57:42	pythonhacker	set	nosy: + pythonhacker
2010-04-06 04:29:51	orsenthil	set	messages: +
2010-04-06 03:21:34	wplappert	set	messages: +
2010-04-06 03:03:24	wplappert	set	files: + wrong.out
2010-04-06 03:02:33	wplappert	set	files: + correct.out
2010-04-06 03:01:47	wplappert	set	files: + Shannon-2010.0.02-extract.html
2010-04-06 03:01:11	wplappert	set	files: + shannon_data.pymessages: +
2010-04-05 21:19:58	wplappert	set	messages: + versions: + Python 3.1
2010-04-05 19:24:27	r.david.murray	set	priority: normalnosy: + orsenthilkeywords: + easystage: test needed
2010-04-05 18:28:49	wplappert	set	title: HTMLparser does not handle call to handle_data when a tag contains nor data. -> HTMLparser does not handle call to handle_data when a tag contains no data.
2010-04-05 18:08:54	wplappert	create