html - Go Packages (original) (raw)

Package html implements an HTML5-compliant tokenizer and parser.

Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.

z := html.NewTokenizer(r)

Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:

for { tt := z.Next() if tt == html.ErrorToken { // ... return ... } // Process the current token. }

There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:

Next {Raw} [ Token | Text | TagName {TagAttr} ]

Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:

for { if z.Next() == html.ErrorToken { // Returning io.EOF indicates success. return z.Err() } emitToken(z.Token()) }

The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:

depth := 0 for { tt := z.Next() switch tt { case html.ErrorToken: return z.Err() case html.TextToken: if depth > 0 { // emitBytes should copy the []byte it receives, // if it doesn't process it immediately. emitBytes(z.Text()) } case html.StartTagToken, html.EndTagToken: tn, _ := z.TagName() if len(tn) == 1 && tn[0] == 'a' { if tt == html.StartTagToken { depth++ } else { depth-- } } } }

Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:

doc, err := html.Parse(r) if err != nil { // ... } for n := range doc.Descendants() { if n.Type == html.ElementNode && n.Data == "a" { // Do something with n... } }

The relevant specifications include:https://html.spec.whatwg.org/multipage/syntax.html andhttps://html.spec.whatwg.org/multipage/syntax.html#tokenization

Security Considerations ¶

Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs.

This package provides both a tokenizer and a parser, which implement the tokenization, and tokenization and tree construction stages of the WHATWG HTML parsing specification respectively. While the tokenizer parses and normalizes individual HTML tokens, only the parser constructs the DOM tree from the tokenized HTML, as described in the tree construction stage of the specification, dynamically modifying or extending the document's DOM tree.

If your use case requires semantically well-formed HTML documents, as defined by the WHATWG specification, the parser should be used rather than the tokenizer.

In security contexts, if trust decisions are being made using the tokenized or parsed content, the input must be re-serialized (for instance by using Render or Token.String) in order for those trust decisions to hold, as the process of tokenization or parsing may alter the content.

Variables
func EscapeString(s string) string
func Render(w io.Writer, n *Node) error
func UnescapeString(s string) string
type Attribute
type Node
type NodeType
type ParseOption
- func ParseOptionEnableScripting(enable bool) ParseOption
type Token
- func (t Token) String() string
type TokenType
- func (t TokenType) String() string
type Tokenizer
- func NewTokenizer(r io.Reader) *Tokenizer
- func NewTokenizerFragment(r io.Reader, contextTag string) *Tokenizer
Parse

This section is empty.

ErrBufferExceeded means that the buffering limit was exceeded.

EscapeString escapes special characters like "<" to become "<". It escapes only five such characters: <, >, &, ' and ". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

Render renders the parse tree n to the given writer.

Rendering is done on a 'best effort' basis: calling Parse on the output of Render will always result in something similar to the original tree, but it is not necessarily an exact clone unless the original tree was 'well-formed'. 'Well-formed' is not easily specified; the HTML5 specification is complicated.

Calling Parse on arbitrary input typically results in a 'well-formed' parse tree. However, it is possible for Parse to yield a 'badly-formed' parse tree. For example, in a 'well-formed' parse tree, no element is a child of another element: parsing "" results in two sibling elements. Similarly, in a 'well-formed' parse tree, no element is a child of a

element: parsing "

" results in a

with two sibling children; the is reparented to the

's parent. However, calling Parse on "

" does not return an error, but the result has an element with an child, and is therefore not 'well-formed'.

Programmatically constructed trees are typically also 'well-formed', but it is possible to construct a tree that looks innocuous but, when rendered and re-parsed, results in a different tree. A simple example is that a solitary text node would become a tree containing , and elements. Another example is that the programmatic equivalent of "abc" becomes "abc".

UnescapeString unescapes entities like "<" to become "<". It unescapes a larger range of entities than EscapeString escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

type Attribute struct { Namespace, Key, Val string }

An Attribute is an attribute namespace-key-value triple. Namespace is non-empty for foreign attributes like xlink, Key is alphabetic (and hence does not contain escapable characters like '&', '<' or '>'), and Val is unescaped (it looks like "a<b" rather than "a<b").

Namespace is only used by the parser, not the tokenizer.

type Node struct { Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node

Type      [NodeType](#NodeType)
DataAtom  [atom](/golang.org/x/net@v0.48.0/html/atom).[Atom](/golang.org/x/net@v0.48.0/html/atom#Atom)
Data      [string](/builtin#string)
Namespace [string](/builtin#string)
Attr      [][Attribute](#Attribute)

}

A Node consists of a NodeType and some Data (tag name for element nodes, content for text) and are part of a tree of Nodes. Element nodes may also have a Namespace and contain a slice of Attributes. Data is unescaped, so that it looks like "a<b" rather than "a<b". For element nodes, DataAtom is the atom for Data, or zero if Data is not a known tag name.

Node trees may be navigated using the link fields (Parent, FirstChild, and so on) or a range loop over iterators such asNode.Descendants.

An empty Namespace implies a "http://www.w3.org/1999/xhtml" namespace. Similarly, "math" is short for "http://www.w3.org/1998/Math/MathML", and "svg" is short for "http://www.w3.org/2000/svg".

Parse returns the parse tree for the HTML from the given Reader.

It implements the HTML5 parsing algorithm (https://html.spec.whatwg.org/multipage/syntax.html#tree-construction), which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end s. Conversely, explicit s in r's data can be silently dropped, with no corresponding node in the resulting tree.

Parse will reject HTML that is nested deeper than 512 elements.

The input is assumed to be UTF-8 encoded.

package main

import ( "fmt" "log" "strings"

"golang.org/x/net/html"
"golang.org/x/net/html/atom"

)

func main() { s := <p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul> doc, err := html.Parse(strings.NewReader(s)) if err != nil { log.Fatal(err) } for n := range doc.Descendants() { if n.Type == html.ElementNode && n.DataAtom == atom.A { for _, a := range n.Attr { if a.Key == "href" { fmt.Println(a.Val) break } } } }

}

Output:

foo /bar/baz

ParseFragment parses a fragment of HTML and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.

It has the same intricacies as Parse.

func ParseFragmentWithOptions(r io.Reader, context Node, opts ...ParseOption) ([]Node, error)

ParseFragmentWithOptions is like ParseFragment, with options.

ParseWithOptions is like Parse, with options.

Ancestors returns an iterator over the ancestors of n, starting with n.Parent.

Mutating a Node or its parents while iterating may have unexpected results.

func (n *Node) AppendChild(c *Node)

AppendChild adds a node c as a child of n.

It will panic if c already has a parent or siblings.

ChildNodes returns an iterator over the immediate children of n, starting with n.FirstChild.

Mutating a Node or its children while iterating may have unexpected results.

Descendants returns an iterator over all nodes recursively beneath n, excluding n itself. Nodes are visited in depth-first preorder.

Mutating a Node or its descendants while iterating may have unexpected results.

func (n *Node) InsertBefore(newChild, oldChild *Node)

InsertBefore inserts newChild as a child of n, immediately before oldChild in the sequence of n's children. oldChild may be nil, in which case newChild is appended to the end of n's children.

It will panic if newChild already has a parent or siblings.

func (n *Node) RemoveChild(c *Node)

RemoveChild removes a node c that is a child of n. Afterwards, c will have no parent and no siblings.

It will panic if c's parent is not n.

A NodeType is the type of a Node.

const ( ErrorNode NodeType = iota TextNode DocumentNode ElementNode DoctypeNode

RawNode

)

type ParseOption func(p *parser)

ParseOption configures a parser.

A Token consists of a TokenType and some Data (tag name for start and end tags, content for text, comments and doctypes). A tag Token may also contain a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b" rather than "a<b"). For tag Tokens, DataAtom is the atom for Data, or zero if Data is not a known tag name.

String returns a string representation of the Token.

A TokenType is the type of a Token.

const (

ErrorToken [TokenType](#TokenType) = [iota](/builtin#iota)

TextToken

StartTagToken

EndTagToken

SelfClosingTagToken
CommentToken

DoctypeToken

)

String returns a string representation of the TokenType.

type Tokenizer struct {

}

A Tokenizer returns a stream of HTML Tokens.

NewTokenizer returns a new HTML Tokenizer for the given Reader. The input is assumed to be UTF-8 encoded.

NewTokenizerFragment returns a new HTML Tokenizer for the given Reader, for tokenizing an existing element's InnerHTML fragment. contextTag is that element's tag, such as "div" or "iframe".

For example, how the InnerHTML "a<b" is tokenized depends on whether it is for a

tag or a