Progressive parsing with StreamParser #2096

jhy · 2024-01-04T04:10:06Z

A StreamParser provides a progressive parse of its input. As each Element is completed, it is emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable.

Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements.

Additionally, the parser provides a selectFirst(String query) / selectNext(String query), which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via another select() call, or via the stream() or iterator() methods.

Once the input has been fully read, the input Reader will be closed. Or, if the whole document does not need to be read, call stop() and close().

The document() method will return the Document being parsed into, which will be only partially complete until the input is fully consumed.

A StreamParser can be reused via a new parse(Reader, String), but is not thread-safe for concurrent inputs. New parsers should be used in each thread.

If created via Connection.Response#streamParser(), or another Reader that is I/O backed, the iterator and
stream consumers will throw an java.io.UncheckedIOException if the underlying Reader errors during read.

The StreamParser interface is currently in beta and may change in subsequent releases. Feedback on the feature and how you're using it is very welcome via the jsoup discussions .

Examples

Process a file in chunks

Assuming we have a file with a bunch of <book> chunks each with many <chapter> elements, but loading it all into the DOM at once might run out of memory. Process the file in chunks by iterating on selectNext(cssquery):

static void streamChunks(Path path) throws IOException {
    try (StreamParser streamer = DataUtil.streamParser(
        path, StandardCharsets.UTF_8, "https://example.com", Parser.xmlParser())) {

        Element el;
        var seenChunks = 0;
        while ((el = streamer.selectNext("book")) != null) {
            // do something more useful! The element will have all its children elements
            Elements chapters = el.select("chapter");
            el.remove(); // remove this chunk once used to keep DOM light and not run out of memory
            seenChunks++;
        }

        Document doc = streamer.document(); // the completed doc, will just be a shell
        log("Title", doc.expectFirst("title"));
        log("Seen chunks", seenChunks);
    }
}

Parse just the meta data of a website

Assume we are building a link preview tool. All the data we need is in the head section of a page, and so there's no need to fetch and parse the complete page. This example will fetch a given URL, parse only the <head> contents and use those, and then cleanly close the request:

static void selectMeta(String url) throws IOException {
    try (StreamParser streamer = Jsoup.connect(url).execute().streamParser()) {
        Element head = streamer.selectFirst("head");
        if (head == null) return;

        log("Title", head.select("title").text());
        log("Description", head.select("meta[name=description]").attr("content"));
        log("Image", head.select("meta[name=twitter:image]").attr("content"));
    }
}

Minify the loaded DOM by removing empty text nodes

This example shows a way to progressively parse an input and remove redundant empty textnodes during the parse, resulting in a (slightly) minified DOM:

static void minifyDocument() {
    String html = "<table><tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>";
    StreamParser streamer = new StreamParser(Parser.htmlParser()).parse(html, "");

    streamer.stream()
        .filter(Element::isBlock)
        .forEach(el -> {
            List<TextNode> textNodes = el.textNodes();
            for (TextNode textNode : textNodes) {
                if (textNode.isBlank())
                    textNode.remove();
            }
        });

    Document minified = streamer.document();
    System.out.println(minified.body());
}

src/test/java/org/jsoup/integration/ConnectIT.java

src/test/java/org/jsoup/parser/StreamParserTest.java

src/test/java/org/jsoup/integration/servlets/SlowRider.java

src/main/java/org/jsoup/helper/HttpConnection.java

src/test/java/org/jsoup/parser/StreamParserTest.java

src/main/java/org/jsoup/parser/TreeBuilder.java

Was failing on CI build for Mac.

Vs an UncheckedIOException. Most users of the StreamParser will be parsing from an InputStream (disk IO or network access) and so these are liable to throw. The StreamParser is autocloseable so will be used in a try with resources block, so no extra burden to catch these.

src/test/java/org/jsoup/integration/ConnectIT.java

jhy · 2024-01-05T00:28:38Z

All: if you're interested in this feature, it would be great if you could try it out before the next release by installing a snapshot version and bashing on it. Please comment here with any feedback (what works / doesn't work) and any suggestions on the API.

See details for building a snapshot on the download page.

821938089 · 2024-01-06T07:11:42Z

Missing parseBodyFragment.

Is it possible to implement the removal of unmatched nodes during the select process?

jhy · 2024-01-07T23:17:52Z

@821938089 can you give a step-by-step example of what you mean by removing unmatched nodes? Or a sample implementation of selectNext(Evaluator eval) that would do it? It seems like it would be easy to trash the document.

821938089 · 2024-01-08T09:01:29Z

There seems to be no way to do immediate removal of mismatched nodes.
I have an imperfect way to do this: design a finite node cache set queue, put each parsed node into it, and remove the oldest node if it adds to the limit.
If a matching node is found remove it from the queue.

jhy · 2024-01-08T10:54:33Z

It's not clear to me what you are trying to get - can you give me a step by step of the input -> selector -> parsed DOM -> removal?

If we just removed every non-matching node after a select, that's going to get a lot of collateral damage.

821938089 · 2024-01-08T11:34:27Z

If you have a very large document and the things you're interested in are scattered throughout the document or at the end of the document, you're going to cause OOM when parsing because the entire document will be in memory.
Your implementation of select looks like it's streaming, but it's using more and more memory.
So I was wondering if there is a way to remove the mismatched nodes to prevent OOM.

jhy · 2024-01-09T00:03:30Z

Yes I am very aware of the risk of OOM, hence in the selectNext example above (Process a file in chunks), the el.remove() call. Or, the document() also is available for other removals. But I am not sure of a safe way to include an "auto-remove" kind of option in a generic way. Which is why I am asking for a stepped example of what kind of input you have, your selector, and which elements you want removed.

I don't think auto-removing the unmatched nodes makes sense in all circumstances as that could be removing elements above and below the selected element, or there may be siblings in the tree what are required later also.

One option might be to remove the previously selected & returned elements. That would help in the example above as a chunk process. But these don't seem very general.

#2096

jhy · 2024-01-10T01:16:01Z

I've added fragment parse options (and a completeFragment to pick up the relevant nodes) with 1f1f72d

#2096

jhy added 9 commits January 1, 2024 14:24

First draft of a streaming parser

b5c9faf

Test stream order

5646eb5

Fleshed out testcases

3225c6e

Fleshed out the documentation

4f6397c

Return StreamParser from Connection

340d16b

Test that Reader is closed when stream is fully used

5ddfbe9

Tests that methods throw unchecked SocketTimeout on timeout

aa6e19b

More tests

69526e2

DataUtil support for StreamParser

c80395e

jhy added this to the 1.18.1 milestone Jan 4, 2024

Javadoc for StreamParser tweaked

ab80c7d

github-advanced-security bot found potential problems Jan 4, 2024

View reviewed changes

jhy added 4 commits January 4, 2024 15:31

Relax test for StreamParser exception type

623e21f

Was failing on CI build for Mac.

Simplified doParseFragment

377f530

Test StreamParser timeout exception in ConnectIT

e680314

github-advanced-security bot found potential problems Jan 4, 2024

View reviewed changes

src/test/java/org/jsoup/integration/ConnectIT.java Dismissed Show dismissed Hide dismissed

jhy added the improvement label Jan 5, 2024

jhy self-assigned this Jan 5, 2024

jhy merged commit 2b443df into master Jan 5, 2024
12 checks passed

jhy deleted the stream-parser branch January 5, 2024 00:14

This was referenced Jan 5, 2024

adding minify option in Parser, so the parsed Document occupies less memory. #2003

Closed

Parsing a part of an html string #2093

Closed

jhy added a commit that referenced this pull request Jan 10, 2024

StreamParser: add fragment parse methods

1f1f72d

#2096

jhy added a commit that referenced this pull request Jul 1, 2024

Changelog for StreamParser

226d17b

#2096

jhy mentioned this pull request Jul 1, 2024

Support SAX style event parser (not just DOM) #824

Closed

skaengus2012 mentioned this pull request Jul 19, 2024

Use jsoup as streaming parsing. skaengus2012/REMINDER_ANDROID#305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progressive parsing with StreamParser #2096

Progressive parsing with StreamParser #2096

jhy commented Jan 4, 2024 •

edited

Loading

jhy commented Jan 5, 2024

821938089 commented Jan 6, 2024

jhy commented Jan 7, 2024

821938089 commented Jan 8, 2024

jhy commented Jan 8, 2024

821938089 commented Jan 8, 2024

jhy commented Jan 9, 2024

jhy commented Jan 10, 2024

Progressive parsing with StreamParser #2096

Progressive parsing with StreamParser #2096

Conversation

jhy commented Jan 4, 2024 • edited Loading

Examples

Process a file in chunks

Parse just the meta data of a website

Minify the loaded DOM by removing empty text nodes

jhy commented Jan 5, 2024

821938089 commented Jan 6, 2024

jhy commented Jan 7, 2024

821938089 commented Jan 8, 2024

jhy commented Jan 8, 2024

821938089 commented Jan 8, 2024

jhy commented Jan 9, 2024

jhy commented Jan 10, 2024

jhy commented Jan 4, 2024 •

edited

Loading