Issue 7114: HTMLParser doesn't handle (original) (raw)

Issue7114

Created on 2009-10-12 21:32 by ggbaker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg93905 - (view)	Author: Greg Baker (ggbaker)	Date: 2009-10-12 21:32
I believe what I'm seeing here is somewhat related to issue 670664, but is easier to handle because of the CDATA structure. Basically, HTMLParser doesn't recognize CDATA sections at all, so their content is incorrectly parsed like normal data. The following is an attempt to parse (a snippet of) valid XHTML, but it raises an HTMLParseError. data = """""" from HTMLParser import HTMLParser parser = HTMLParser() parser.feed(data)
msg96164 - (view)	Author: Denis (Denis)	Date: 2009-12-09 01:16
The CDATA sections are part of XML specification. http://www.w3.org/TR/REC-xml/#sec-cdata-sect HTML is not XML, so HTMLParser does the right thing here.
msg99604 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-02-19 23:50
There's no bug here, afaict.
msg100852 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2010-03-11 13:56
And to clarify, XHTML is an reformulation of HTML4 using XML syntax, so you should use an XML parser to parse it, not an HTML parser. The formats are related, but not identical.

History
Date	User	Action	Args
2022-04-11 14:56:53	admin	set	github: 51363
2010-03-11 13:56:58	effbot	set	messages: +
2010-02-26 11:23:41	flox	set	status: pending -> closed
2010-02-20 00:57:36	flox	set	status: open -> pending
2010-02-20 00:57:13	flox	set	status: pending -> openassignee: effbot ->
2010-02-19 23:50:06	flox	set	status: open -> pendingpriority: normalcomponents: + XML, - Library (Lib)assignee: effbotnosy: + effbot, floxmessages: + resolution: not a bugstage: resolved
2009-12-09 01:16:11	Denis	set	nosy: + Denismessages: +
2009-10-12 21:32:50	ggbaker	create