cpython: 1850f45f6169 (original) (raw)
--- a/Doc/library/html.parser.rst
+++ b/Doc/library/html.parser.rst
@@ -19,14 +19,15 @@ parsing text files formatted in HTML (Hy
.. class:: HTMLParser(strict=True)
Create a parser instance. If strict is True
(the default), invalid
- HTML results in :exc:
~html.parser.HTMLParseError
exceptions [#]_. If strict isFalse
, the parser uses heuristics to make a best guess at
- the intention of any invalid HTML it encounters, similar to the way most
- browsers do. Using
strict=False
is advised.
- An :class:
HTMLParser
instance is fed HTML data and calls handler functions when tags - begin and end. The :class:
HTMLParser
class is meant to be overridden by the - user to provide a desired behavior.
- An :class:
.HTMLParser
instance is fed HTML data and calls handler methods - when start tags, end tags, text, comments, and other markup elements are
- encountered. The user should subclass :class:
.HTMLParser
and override its - methods to implement the desired behavior.
This parser does not check that end tags match start tags or call the end-tag
handler for elements which are closed implicitly by closing an outer element.
@@ -39,25 +40,61 @@ An exception is defined as well:
.. exception:: HTMLParseError
Exception raised by the :class:
HTMLParser
class when it encounters an error
- while parsing. This exception provides three attributes: :attr:
msg
is a brief - message explaining the error, :attr:
lineno
is the number of the line on which - the broken construct was detected, and :attr:
offset
is the number of - characters into the line at which the construct starts.
- while parsing and strict is
True
. This exception provides three - attributes: :attr:
msg
is a brief message explaining the error, - :attr:
lineno
is the number of the line on which the broken construct was - detected, and :attr:
offset
is the number of characters into the line at - which the construct starts. +
+
+Example HTML Parser Application
+-------------------------------
+
+As a basic example, below is a simple HTML parser that uses the
+:class:HTMLParser
class to print out start tags, end tags, and data
+as they are encountered::
+
- from html.parser import HTMLParser +
- class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):[](#l1.50)
print("Encountered a start tag:", tag)[](#l1.51)
def handle_endtag(self, tag):[](#l1.52)
print("Encountered an end tag :", tag)[](#l1.53)
def handle_data(self, data):[](#l1.54)
print("Encountered some data :", data)[](#l1.55)
- parser = MyHTMLParser(strict=False)
- parser.feed('Test'
'<body><h1>Parse me!</h1></body></html>')[](#l1.59)
+ +The output will then be:: +
- Encountered a start tag: html
- Encountered a start tag: head
- Encountered a start tag: title
- Encountered some data : Test
- Encountered an end tag : title
- Encountered an end tag : head
- Encountered a start tag: body
- Encountered a start tag: h1
- Encountered some data : Parse me
- Encountered an end tag : h1
- Encountered an end tag : body
- Encountered an end tag : html +
+
+:class:.HTMLParser
Methods
+----------------------------
:class:HTMLParser
instances have the following methods:
-.. method:: HTMLParser.reset()
-
- .. method:: HTMLParser.feed(data) Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or
.. method:: HTMLParser.close()
@@ -68,6 +105,12 @@ An exception is defined as well:
the :class:HTMLParser
base class method :meth:close
.
+.. method:: HTMLParser.reset()
+
+
.. method:: HTMLParser.getpos()
Return current line number and offset.
@@ -81,23 +124,35 @@ An exception is defined as well:
attributes can be preserved, etc.).
+The following methods are called when data or markup elements are encountered
+and they are meant to be overridden in a subclass. The base class
+implementations do nothing (except for :meth:~HTMLParser.handle_startendtag
):
+
+
.. method:: HTMLParser.handle_starttag(tag, attrs)
- This method is called to handle the start of a tag. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
- This method is called to handle the start of a tag (e.g.
<div id="main">
). The tag argument is the name of the tag converted to lower case. The attrs argument is a list of(name, value)
pairs containing the attributes found inside the tag's<>
brackets. The name will be translated to lower case, and quotes in the value have been removed, and character and entity references
- have been replaced. For instance, for the tag ``<A
- HREF="``" title="undefined" rel="noopener noreferrer">http://www.cwi.nl/">``, this method would be called as
handle_starttag('a', [('href', 'http://www.cwi.nl/')])
.
- have been replaced. +
- For instance, for the tag
<A HREF="http://www.cwi.nl/">
, this method - would be called as
handle_starttag('a', [('href', 'http://www.cwi.nl/')])
. All entity references from :mod:html.entities
are replaced in the attribute values.
+.. method:: HTMLParser.handle_endtag(tag) +
- This method is called to handle the end tag of an element (e.g.
</div>
). + - The tag argument is the name of the tag converted to lower case. +
+
.. method:: HTMLParser.handle_startendtag(tag, attrs)
Similar to :meth:handle_starttag
, but called when the parser encounters an
@@ -106,57 +161,46 @@ An exception is defined as well:
implementation simply calls :meth:handle_starttag
and :meth:handle_endtag
.
-.. method:: HTMLParser.handle_endtag(tag)
-
- This method is called to handle the end tag of an element. It is intended to be
- overridden by a derived class; the base class implementation does nothing. The
- tag argument is the name of the tag converted to lower case. -
- .. method:: HTMLParser.handle_data(data)
- This method is called to process arbitrary data (e.g. the content of
<script>...</script>
and<style>...</style>
). It is intended to be- overridden by a derived class; the base class implementation does nothing. -
- -.. method:: HTMLParser.handle_charref(name) -
- This method is called to process a character reference of the form
&#ref;
. - It is intended to be overridden by a derived class; the base class
- implementation does nothing.
- This method is called to process arbitrary data (e.g. text nodes and the
- content of
<script>...</script>
and<style>...</style>
).
.. method:: HTMLParser.handle_entityref(name)
- This method is called to process a general entity reference of the form
&name;
where name is an general entity reference. It is intended to be- overridden by a derived class; the base class implementation does nothing.
- This method is called to process a named character reference of the form
&name;
(e.g.>
), where name is a general entity reference- (e.g.
'gt'
). +
+ +.. method:: HTMLParser.handle_charref(name) +
- This method is called to process decimal and hexadecimal numeric character
- references of the form
&#NNN;
and&#xNNN;
. For example, the decimal - equivalent for
>
is>
, whereas the hexadecimal is>
; - in this case the method will receive
'62'
or'x3E'
.
.. method:: HTMLParser.handle_comment(data)
- This method is called when a comment is encountered. The comment argument is
- a string containing the text between the
--
and--
delimiters, but not - the delimiters themselves. For example, the comment
<!--text-->
will cause - this method to be called with the argument
'text'
. It is intended to be - overridden by a derived class; the base class implementation does nothing.
- This method is called when a comment is encountered (e.g.
<!--comment-->
). + - For example, the comment
<!-- comment -->
will cause this method to be - called with the argument
' comment '
. + - The content of Internet Explorer conditional comments (condcoms) will also be
- sent to this method, so, for
<!--[if IE 9]>IE9-specific content<![endif]-->
, - this method will receive
'[if IE 9]>IE-specific content<![endif]'
.
.. method:: HTMLParser.handle_decl(decl)
- Method called when an SGML
doctype
declaration is read by the parser. - The decl parameter will be the entire contents of the declaration inside
- the
<!...>
markup. It is intended to be overridden by a derived class; - the base class implementation does nothing. -
-.. method:: HTMLParser.unknown_decl(data) -
- Method called when an unrecognized SGML declaration is read by the parser.
- The data parameter will be the entire contents of the declaration inside
- the
<!...>
markup. It is sometimes useful to be overridden by a - derived class; the base class implementation raises an :exc:
HTMLParseError
.
- The decl parameter will be the entire contents of the declaration inside
- the
<!...>
markup (e.g.'DOCTYPE html'
).
.. method:: HTMLParser.handle_pi(data)
@@ -174,29 +218,123 @@ An exception is defined as well:
cause the '?'
to be included in data.
-.. _htmlparser-example:
+.. method:: HTMLParser.unknown_decl(data)
+
- This method is called when an unrecognized declaration is read by the parser. +
- The data parameter will be the entire contents of the declaration inside
- the
<![...]>
markup. It is sometimes useful to be overridden by a - derived class. The base class implementation raises an :exc:
HTMLParseError
- when strict is
True
.
-Example HTML Parser Application
--------------------------------
+
+.. _htmlparser-examples:
-As a basic example, below is a simple HTML parser that uses the
-:class:HTMLParser
class to print out start tags, end tags, and data
-as they are encountered::
+Examples
+--------
+
+The following class implements a parser that will be used to illustrate more
+examples::
from html.parser import HTMLParser
- from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)[](#l1.268)
print("Start tag:", tag)[](#l1.269)
for attr in attrs:[](#l1.270)
print(" attr:", attr)[](#l1.271) def handle_endtag(self, tag):[](#l1.272)
print("Encountered an end tag:", tag)[](#l1.273)
print("End tag :", tag)[](#l1.274) def handle_data(self, data):[](#l1.275)
print("Encountered some data:", data)[](#l1.276)
print("Data :", data)[](#l1.277)
def handle_comment(self, data):[](#l1.278)
print("Comment :", data)[](#l1.279)
def handle_entityref(self, name):[](#l1.280)
c = chr(name2codepoint[name])[](#l1.281)
print("Named ent:", c)[](#l1.282)
def handle_charref(self, name):[](#l1.283)
if name.startswith('x'):[](#l1.284)
c = chr(int(name[1:], 16))[](#l1.285)
else:[](#l1.286)
c = chr(int(name))[](#l1.287)
print("Num ent :", c)[](#l1.288)
def handle_decl(self, data):[](#l1.289)
print("Decl :", data)[](#l1.290)
parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
- ... '"')[](#l1.297)" title="undefined" rel="noopener noreferrer">http://www.w3.org/TR/html4/strict.dtd">')[](#l1.297)
- Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"[](#l1.298) +
+Parsing an element with a few attributes and a title:: +
- Start tag: img
attr: ('src', 'python-logo.png')[](#l1.304)
attr: ('alt', 'The Python logo')[](#l1.305)
parser.feed('
Python
')- Start tag: h1
- Data : Python
- End tag : h1 +
+The content of script
and style
elements is returned as is, without
+further parsing::
- Start tag: style
attr: ('type', 'text/css')[](#l1.320)
- Data : #python { color: green }
- End tag : style
- Start tag: script
attr: ('type', 'text/javascript')[](#l1.327)
- Data : alert("hello!");
- End tag : script +
+Parsing named and numeric character references and converting them to the
+correct char (note: these 3 references are all equivalent to '>'
)::
+Feeding incomplete chunks to :meth:~HTMLParser.feed
works, but
+:meth:~HTMLParser.handle_data
might be called more than once::
+
for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
- ... parser.feed(chunk)
- ...
- Start tag: span
- Data : buff
- Data : ered
- Data : text
- End tag : span +
+Parsing invalid HTML (e.g. unquoted attributes) also works:: +