jsoup release 1.18.1 (2024-Jul-10) (original) (raw)
jsoup Java HTML Parser release 1.18.1
Jul 10, 2024
jsoup 1.18.1 is out now, with a new streaming parser that provides a hybrid DOM + SAX event-driven parsing interface, request progress tracking, and many other improvements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Stream Parser: A
[StreamParser](/apidocs/org/jsoup/parser/StreamParser "A StreamParser provides a progressive parse of its input.")provides a progressive parse of its input. For URL requests, available via[Connection.Response.streamParser()](/apidocs/org/jsoup/Connection.Response#streamParser%28%29 "Returns a StreamParser that will parse the Response progressively."). As each[Element](/apidocs/org/jsoup/nodes/Element "An HTML Element consists of a tag name, attributes, and child nodes (including text nodes and other elements).")is completed, it is emitted via aStreamorIteratorinterface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable. Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements. Additionally, the parser provides a[selectFirst(String query)](/apidocs/org/jsoup/nodes/Element#selectFirst%28java.lang.String%29 "Find the first Element that matches the Selector CSS query, with this element as the starting context.")/[selectNext(String query)](/apidocs/org/jsoup/parser/StreamParser#selectNext%28java.lang.String%29 "Finds the next Element that matches the provided query."), which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via anotherselect()call, or via the[stream()](/apidocs/org/jsoup/nodes/Element#stream%28%29 "Returns a Stream of this Element and all of its descendant Elements.")or[iterator()](/apidocs/org/jsoup/nodes/Element#iterator%28%29 "Returns an Iterator that iterates this Element and each of its descendant Elements, in document order.")methods. #2096 (with examples) - Download Progress: added a Response
[Progress](/apidocs/org/jsoup/Progress)event interface, which reports progress and URLs are downloaded (and parsed). Set via[Connection.onResponseProgress()](/apidocs/org/jsoup/Connection#onResponseProgress%28org.jsoup.Progress%29 "Set the response progress handler, which will be called periodically as the response body is downloaded."). Supported on both a session and a single connection level. #2164, #656 - Added
Pathaccepting parse methods:[Jsoup.parse(Path)](/apidocs/org/jsoup/Jsoup#parse%28java.nio.file.Path%29 "Parse the contents of a file as HTML."),Jsoup.parse(path, charsetName, baseUri, parser), etc. #2055 - Updated the
buttontag configuration to include a space between multiple button elements in the[Element.text()](/apidocs/org/jsoup/nodes/Element#text%28%29 "Gets the normalized, combined text of this element and all its children.")method. #2105 - Added support for the
ns|*all elements in namespace Selector. #1811 - When normalising attribute names during serialization, invalid characters are now replaced with
_, vs being stripped. This should make the process clearer, and generally prevent an invalid attribute name being coerced unexpectedly. #2143
Changes
- Removed previously deprecated internal classes and methods. #2094
- Build change: the built jar’s OSGi manifest no longer imports itself. #2158
Bug Fixes
- When tracking source positions, if the first node was a TextNode, its position was incorrectly set to
-1.#2106 - When connecting (or redirecting) to URLs with characters such as
{,}in the path, a Malformed URL exception would be thrown (if in development), or the URL might otherwise not be escaped correctly (if in production). The URL encoding process has been improved to handle these characters correctly. #2142 - When using
[W3CDom](/apidocs/org/jsoup/helper/W3CDom "Helper class to transform a Document to a org.w3c.dom.Document, for integration with toolsets that use the W3C DOM.")with a custom output Document, a Null Pointer Exception would be thrown. #2114 - The
:has()selector did not match correctly when using sibling combinators (like e.g.:h1:has(+h2)). #2137 - The
:emptyselector incorrectly matched elements that started with a blank text node and were followed by non-empty nodes, due to an incorrect short-circuit. #2130 [Element.cssSelector()](/apidocs/org/jsoup/nodes/Element#cssSelector%28%29 "Get a CSS selector that will uniquely select this element.")would fail with “Did not find balanced marker” when building a selector for elements that had a(or[in their class names. And selectors with those characters escaped would not match as expected. <small>\[https://github.com/jhy/jsoup/issues/2146 #2146]()- Updated
Entities.escape(string)to make the escaped text suitable for both text nodes and attributes (previously was only for text nodes). This does not impact the output of[Element.html()](/apidocs/org/jsoup/nodes/Element#html%28%29 "Retrieves the element's inner HTML.")which correctly applies a minimal escape depending on if the use will be for text data or in a quoted attribute. #1278 - Fuzz: a Stack Overflow exception could occur when resolving a crafted
<base href>URL, in the normalizing regex. #2165
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.