proxy.golang.org : golang.org/x/net/html
Package html implements an HTML5-compliant tokenizer and parser. Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML. Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error: There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is: Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example: The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text: Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order: The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs. This package provides both a tokenizer and a parser, which implement the tokenization, and tokenization and tree construction stages of the WHATWG HTML parsing specification respectively. While the tokenizer parses and normalizes individual HTML tokens, only the parser constructs the DOM tree from the tokenized HTML, as described in the tree construction stage of the specification, dynamically modifying or extending the document's DOM tree. If your use case requires semantically well-formed HTML documents, as defined by the WHATWG specification, the parser should be used rather than the tokenizer. In security contexts, if trust decisions are being made using the tokenized or parsed content, the input must be re-serialized (for instance by using Render or Token.String) in order for those trust decisions to hold, as the process of tokenization or parsing may alter the content.
Registry
-
Source
- Documentation
- JSON
purl: pkg:golang/golang.org/x/net/html
License: BSD-3-Clause
Latest release: 12 days ago
Namespace: golang.org/x/net
Last synced: 12 days ago