cpython: 7052eb923fb8 (original) (raw)
--- a/Doc/library/htmlparser.rst
+++ b/Doc/library/htmlparser.rst
@@ -22,7 +22,7 @@
--------------
-This module defines a class :class:HTMLParser
which serves as the basis for
+This module defines a class :class:.HTMLParser
which serves as the basis for
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Unlike the parser in :mod:htmllib
, this parser is not based on the SGML parser
in :mod:sgmllib
.
@@ -30,11 +30,12 @@ in :mod:sgmllib
.
.. class:: HTMLParser()
- An :class:
.HTMLParser
instance is fed HTML data and calls handler methods - when start tags, end tags, text, comments, and other markup elements are
- encountered. The user should subclass :class:
.HTMLParser
and override its - methods to implement the desired behavior.
- An :class:
HTMLParser
instance is fed HTML data and calls handler functions when tags - begin and end. The :class:
HTMLParser
class is meant to be overridden by the - user to provide a desired behavior.
- The :class:
.HTMLParser
class is instantiated without arguments. Unlike the parser in :mod:htmllib
, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed @@ -42,22 +43,59 @@ in :mod:sgmllib
.
An exception is defined as well: - .. exception:: HTMLParseError
- Exception raised by the :class:
HTMLParser
class when it encounters an error - while parsing. This exception provides three attributes: :attr:
msg
is a brief - message explaining the error, :attr:
lineno
is the number of the line on which - the broken construct was detected, and :attr:
offset
is the number of
- :class:
.HTMLParser
is able to handle broken markup, but in some cases it - might raise this exception when it encounters an error while parsing.
- This exception provides three attributes: :attr:
msg
is a brief - message explaining the error, :attr:
lineno
is the number of the line on - which the broken construct was detected, and :attr:
offset
is the number of characters into the line at which the construct starts.
-:class:HTMLParser
instances have the following methods:
+
+Example HTML Parser Application
+-------------------------------
+
+As a basic example, below is a simple HTML parser that uses the
+:class:.HTMLParser
class to print out start tags, end tags and data
+as they are encountered::
+
- from HTMLParser import HTMLParser +
create a subclass and override the handler methods
- class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):[](#l1.60)
print "Encountered a start tag:", tag[](#l1.61)
def handle_endtag(self, tag):[](#l1.62)
print "Encountered an end tag :", tag[](#l1.63)
def handle_data(self, data):[](#l1.64)
print "Encountered some data :", data[](#l1.65)
instantiate the parser and fed it some HTML
- parser = MyHTMLParser()
- parser.feed('Test'
'<body><h1>Parse me!</h1></body></html>')[](#l1.70)
+ +The output will then be:: +
- Encountered a start tag: html
- Encountered a start tag: head
- Encountered a start tag: title
- Encountered some data : Test
- Encountered an end tag : title
- Encountered an end tag : head
- Encountered a start tag: body
- Encountered a start tag: h1
- Encountered some data : Parse me
- Encountered an end tag : h1
- Encountered an end tag : body
- Encountered an end tag : html
-.. method:: HTMLParser.reset()
+:class:.HTMLParser
Methods
+----------------------------
- Reset the instance. Loses all unprocessed data. This is called implicitly at
- instantiation time.
+:class:
.HTMLParser
instances have the following methods:
.. method:: HTMLParser.feed(data) @@ -73,7 +111,13 @@ An exception is defined as well: Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call
+ +.. method:: HTMLParser.reset() +
.. method:: HTMLParser.getpos()
@@ -89,22 +133,34 @@ An exception is defined as well:
attributes can be preserved, etc.).
+The following methods are called when data or markup elements are encountered
+and they are meant to be overridden in a subclass. The base class
+implementations do nothing (except for :meth:~HTMLParser.handle_startendtag
):
+
+
.. method:: HTMLParser.handle_starttag(tag, attrs)
- This method is called to handle the start of a tag. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
- This method is called to handle the start of a tag (e.g.
<div id="main">
). The tag argument is the name of the tag converted to lower case. The attrs argument is a list of(name, value)
pairs containing the attributes found inside the tag's<>
brackets. The name will be translated to lower case, and quotes in the value have been removed, and character and entity references
- have been replaced. For instance, for the tag ``<A
- HREF="``" title="undefined" rel="noopener noreferrer">http://www.cwi.nl/">``, this method would be called as
handle_starttag('a', [('href', 'http://www.cwi.nl/')])
.
- have been replaced. +
- For instance, for the tag
<A HREF="http://www.cwi.nl/">
, this method - would be called as
handle_starttag('a', [('href', 'http://www.cwi.nl/')])
. .. versionchanged:: 2.6
All entity references from :mod:`htmlentitydefs` are now replaced in the attribute[](#l1.141)
values.[](#l1.142)
All entity references from :mod:`htmlentitydefs` are now replaced in the[](#l1.143)
attribute values.[](#l1.144)
+ + +.. method:: HTMLParser.handle_endtag(tag) +
- This method is called to handle the end tag of an element (e.g.
</div>
). + - The tag argument is the name of the tag converted to lower case.
.. method:: HTMLParser.handle_startendtag(tag, attrs)
@@ -115,94 +171,175 @@ An exception is defined as well:
implementation simply calls :meth:handle_starttag
and :meth:handle_endtag
.
-.. method:: HTMLParser.handle_endtag(tag)
+.. method:: HTMLParser.handle_data(data)
- This method is called to handle the end tag of an element. It is intended to be
- overridden by a derived class; the base class implementation does nothing. The
- tag argument is the name of the tag converted to lower case.
- This method is called to process arbitrary data (e.g. text nodes and the
- content of
<script>...</script>
and<style>...</style>
).
-.. method:: HTMLParser.handle_data(data) +.. method:: HTMLParser.handle_entityref(name)
- This method is called to process arbitrary data (e.g. the content of
<script>...</script>
and<style>...</style>
). It is intended to be- overridden by a derived class; the base class implementation does nothing.
- This method is called to process a named character reference of the form
&name;
(e.g.>
), where name is a general entity reference- (e.g.
'gt'
).
.. method:: HTMLParser.handle_charref(name)
- This method is called to process a character reference of the form
&#ref;
. - It is intended to be overridden by a derived class; the base class
- implementation does nothing. -
- -.. method:: HTMLParser.handle_entityref(name) -
- This method is called to process a general entity reference of the form
&name;
where name is an general entity reference. It is intended to be- overridden by a derived class; the base class implementation does nothing.
- This method is called to process decimal and hexadecimal numeric character
- references of the form
&#NNN;
and&#xNNN;
. For example, the decimal - equivalent for
>
is>
, whereas the hexadecimal is>
; - in this case the method will receive
'62'
or'x3E'
.
.. method:: HTMLParser.handle_comment(data)
- This method is called when a comment is encountered. The comment argument is
- a string containing the text between the
--
and--
delimiters, but not - the delimiters themselves. For example, the comment
<!--text-->
will cause - this method to be called with the argument
'text'
. It is intended to be - overridden by a derived class; the base class implementation does nothing.
- This method is called when a comment is encountered (e.g.
<!--comment-->
). + - For example, the comment
<!-- comment -->
will cause this method to be - called with the argument
' comment '
. + - The content of Internet Explorer conditional comments (condcoms) will also be
- sent to this method, so, for
<!--[if IE 9]>IE9-specific content<![endif]-->
, - this method will receive
'[if IE 9]>IE-specific content<![endif]'
.
.. method:: HTMLParser.handle_decl(decl)
- Method called when an SGML
doctype
declaration is read by the parser. - The decl parameter will be the entire contents of the declaration inside
- the
<!...>
markup. It is intended to be overridden by a derived class; - the base class implementation does nothing. -
-.. method:: HTMLParser.unknown_decl(data) -
- Method called when an unrecognized SGML declaration is read by the parser.
- The data parameter will be the entire contents of the declaration inside
- the
<!...>
markup. It is sometimes useful to be overridden by a - derived class; the base class implementation throws an :exc:
HTMLParseError
.
- The decl parameter will be the entire contents of the declaration inside
- the
<!...>
markup (e.g.'DOCTYPE html'
).
.. method:: HTMLParser.handle_pi(data)
- Method called when a processing instruction is encountered. The data
- parameter will contain the entire processing instruction. For example, for the
- This method is called when a processing instruction is encountered. The data
- parameter will contain the entire processing instruction. For example, for the
processing instruction
<?proc color='red'>
, this method would be called as
handle_pi("proc color='red'")
. It is intended to be overridden by a derived- class; the base class implementation does nothing.
The :class:`HTMLParser` class uses the SGML syntactic rules for processing[](#l1.248)
The :class:`.HTMLParser` class uses the SGML syntactic rules for processing[](#l1.249) instructions. An XHTML processing instruction using the trailing ``'?'`` will[](#l1.250) cause the ``'?'`` to be included in *data*.[](#l1.251)
-.. _htmlparser-example: +.. method:: HTMLParser.unknown_decl(data) +
- This method is called when an unrecognized declaration is read by the parser. +
- The data parameter will be the entire contents of the declaration inside
- the
<![...]>
markup. It is sometimes useful to be overridden by a - derived class.
-Example HTML Parser Application
--------------------------------
+
+.. _htmlparser-examples:
-As a basic example, below is a simple HTML parser that uses the
-:class:HTMLParser
class to print out start tags, end tags and data
-as they are encountered::
+Examples
+--------
+
+The following class implements a parser that will be used to illustrate more
+examples::
from HTMLParser import HTMLParser
- from htmlentitydefs import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag[](#l1.282)
print "Start tag:", tag[](#l1.283)
for attr in attrs:[](#l1.284)
print " attr:", attr[](#l1.285) def handle_endtag(self, tag):[](#l1.286)
print "Encountered an end tag:", tag[](#l1.287)
print "End tag :", tag[](#l1.288) def handle_data(self, data):[](#l1.289)
print "Encountered some data:", data[](#l1.290)
print "Data :", data[](#l1.292)
def handle_comment(self, data):[](#l1.293)
print "Comment :", data[](#l1.294)
def handle_entityref(self, name):[](#l1.295)
c = unichr(name2codepoint[name])[](#l1.296)
print "Named ent:", c[](#l1.297)
def handle_charref(self, name):[](#l1.298)
if name.startswith('x'):[](#l1.299)
c = unichr(int(name[1:], 16))[](#l1.300)
else:[](#l1.301)
c = unichr(int(name))[](#l1.302)
print "Num ent :", c[](#l1.303)
def handle_decl(self, data):[](#l1.304)
print "Decl :", data[](#l1.305)
parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
- ... '"')[](#l1.314)" title="undefined" rel="noopener noreferrer">http://www.w3.org/TR/html4/strict.dtd">')[](#l1.314)
- Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"[](#l1.315) +
+Parsing an element with a few attributes and a title:: +
- Start tag: img
attr: ('src', 'python-logo.png')[](#l1.321)
attr: ('alt', 'The Python logo')[](#l1.322)
parser.feed('
Python
')- Start tag: h1
- Data : Python
- End tag : h1 +
+The content of script
and style
elements is returned as is, without
+further parsing::
+
- Start tag: style
attr: ('type', 'text/css')[](#l1.334)
- Data : #python { color: green }
- End tag : style
- Start tag: script
attr: ('type', 'text/javascript')[](#l1.341)
- Data : alert("hello!");
- End tag : script +
+Parsing named and numeric character references and converting them to the
+correct char (note: these 3 references are all equivalent to '>'
)::
+
+Feeding incomplete chunks to :meth:~HTMLParser.feed
works, but
+:meth:~HTMLParser.handle_data
might be called more than once::
+
for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
- ... parser.feed(chunk)
- ...
- Start tag: span
- Data : buff
- Data : ered
- Data : text
- End tag : span +
+Parsing invalid HTML (e.g. unquoted attributes) also works:: +