When parsing HTML and having a string along the lines of
, a call to handle_data is not issued between handle_starttag and handle_endtag, but afterwards. The problem is in HTMLparser.goahead, where the position i and j are calculated. The code reads if i < j: self.handle_data(rawdata[i:j]) but it should be if i <= j: self.handle_data(rawdata[i:j]) If there is data between
and
, everything works fine. I just checked the trunk of 2.6, this occurs in line 142 of Lib/HTMLParser.py. The size of HTMLParser.py is 13407 bytes, and is dated 'Feb 26 19:25'.
in short the correct output should be 2/4/2010;6.3;11.1;0.8;6.5;;7.8;-5 versus 2/4/2010;6.3;11.1;0.8;6.5;7.8;-5 which implies that one element is missing in the output stream :)
But changing the HTMLParser.goahead's way to treating tags from if i < j: self.handle_data(rawdata[i:j]) TO if i <= j: self.handle_data(rawdata[i:j] is not the correct way to deal with this problem. Theoretically, whatever it is doing seems correct. As there is no data, don't call handle_data. I can understand your testcase, and I think there is some other way to handle the test you are mentioning. If you change the above line, many of the existing tests may fail, so that *may not be* way to go.
I have modified my program so I will check for data/no-data at the end of a td-call (td_end). Now it produces the correct result. I think you can close this issue.
priority: normalnosy: + orsenthilkeywords: + easystage: test needed
2010-04-05 18:28:49
wplappert
set
title: HTMLparser does not handle call to handle_data when a tag contains nor data. -> HTMLparser does not handle call to handle_data when a tag contains no data.