Issue 13358: HTMLParser incorrectly handles cdata elements. (original) (raw)

The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'<(/|\Z)'), however this script tag is setting a variable with data that contains "" which will terminate this script tag prematurely.

I have written and tested the following patch on my system: #used to terminate cdata elements endtagfind_script = re.compile('(?i)</\s*script\s*>') endtagfind_style = re.compile('(?i)</\s*style\s*>')

class html_patch(HTMLParser.HTMLParser): # Internal -- sets the proper tag terminator based on cdata element type def set_cdata_mode(self, tag): #We check if the script is either a style or a script #based on self.CDATA_CONTENT_ELEMENTS if tag=="style": self.interesting = endtagfind_style elif tag=="script": self.interesting = endtagfind_script else: self.error("Unknown cdata type:"+tag) # should never happen self.cdata_tag = tag

This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser: