Issue 7114: HTMLParser doesn't handle (original) (raw)

Issue7114

Created on 2009-10-12 21:32 by ggbaker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg93905 - (view) Author: Greg Baker (ggbaker) Date: 2009-10-12 21:32
I believe what I'm seeing here is somewhat related to issue 670664, but is easier to handle because of the CDATA structure. Basically, HTMLParser doesn't recognize CDATA sections at all, so their content is incorrectly parsed like normal data. The following is an attempt to parse (a snippet of) valid XHTML, but it raises an HTMLParseError. data = """""" from HTMLParser import HTMLParser parser = HTMLParser() parser.feed(data)
msg96164 - (view) Author: Denis (Denis) Date: 2009-12-09 01:16
The CDATA sections are part of XML specification. http://www.w3.org/TR/REC-xml/#sec-cdata-sect HTML is not XML, so HTMLParser does the right thing here.
msg99604 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-19 23:50
There's no bug here, afaict.
msg100852 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-11 13:56
And to clarify, XHTML is an reformulation of HTML4 using XML syntax, so you should use an XML parser to parse it, not an HTML parser. The formats are related, but not identical.
History
Date User Action Args
2022-04-11 14:56:53 admin set github: 51363
2010-03-11 13:56:58 effbot set messages: +
2010-02-26 11:23:41 flox set status: pending -> closed
2010-02-20 00:57:36 flox set status: open -> pending
2010-02-20 00:57:13 flox set status: pending -> openassignee: effbot ->
2010-02-19 23:50:06 flox set status: open -> pendingpriority: normalcomponents: + XML, - Library (Lib)assignee: effbotnosy: + effbot, floxmessages: + resolution: not a bugstage: resolved
2009-12-09 01:16:11 Denis set nosy: + Denismessages: +
2009-10-12 21:32:50 ggbaker create