Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progressive parsing with StreamParser #2096

Merged
merged 14 commits into from
Jan 5, 2024
Merged

Progressive parsing with StreamParser #2096

merged 14 commits into from
Jan 5, 2024

Conversation

jhy
Copy link
Owner

@jhy jhy commented Jan 4, 2024

A StreamParser provides a progressive parse of its input. As each Element is completed, it is emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable.

Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements.

Additionally, the parser provides a selectFirst(String query) / selectNext(String query), which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via another select() call, or via the stream() or iterator() methods.

Once the input has been fully read, the input Reader will be closed. Or, if the whole document does not need to be read, call stop() and close().

The document() method will return the Document being parsed into, which will be only partially complete until the input is fully consumed.

A StreamParser can be reused via a new parse(Reader, String), but is not thread-safe for concurrent inputs. New parsers should be used in each thread.

If created via Connection.Response#streamParser(), or another Reader that is I/O backed, the iterator and
stream consumers will throw an java.io.UncheckedIOException if the underlying Reader errors during read.

The StreamParser interface is currently in beta and may change in subsequent releases. Feedback on the feature and how you're using it is very welcome via the jsoup discussions .

Examples

Process a file in chunks

Assuming we have a file with a bunch of <book> chunks each with many <chapter> elements, but loading it all into the DOM at once might run out of memory. Process the file in chunks by iterating on selectNext(cssquery):

static void streamChunks(Path path) throws IOException {
    try (StreamParser streamer = DataUtil.streamParser(
        path, StandardCharsets.UTF_8, "https://example.com", Parser.xmlParser())) {

        Element el;
        var seenChunks = 0;
        while ((el = streamer.selectNext("book")) != null) {
            // do something more useful! The element will have all its children elements
            Elements chapters = el.select("chapter");
            el.remove(); // remove this chunk once used to keep DOM light and not run out of memory
            seenChunks++;
        }

        Document doc = streamer.document(); // the completed doc, will just be a shell
        log("Title", doc.expectFirst("title"));
        log("Seen chunks", seenChunks);
    }
}

Parse just the meta data of a website

Assume we are building a link preview tool. All the data we need is in the head section of a page, and so there's no need to fetch and parse the complete page. This example will fetch a given URL, parse only the <head> contents and use those, and then cleanly close the request:

static void selectMeta(String url) throws IOException {
    try (StreamParser streamer = Jsoup.connect(url).execute().streamParser()) {
        Element head = streamer.selectFirst("head");
        if (head == null) return;

        log("Title", head.select("title").text());
        log("Description", head.select("meta[name=description]").attr("content"));
        log("Image", head.select("meta[name=twitter:image]").attr("content"));
    }
}

Minify the loaded DOM by removing empty text nodes

This example shows a way to progressively parse an input and remove redundant empty textnodes during the parse, resulting in a (slightly) minified DOM:

static void minifyDocument() {
    String html = "<table><tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>";
    StreamParser streamer = new StreamParser(Parser.htmlParser()).parse(html, "");

    streamer.stream()
        .filter(Element::isBlock)
        .forEach(el -> {
            List<TextNode> textNodes = el.textNodes();
            for (TextNode textNode : textNodes) {
                if (textNode.isBlank())
                    textNode.remove();
            }
        });

    Document minified = streamer.document();
    System.out.println(minified.body());
}

@jhy jhy added this to the 1.18.1 milestone Jan 4, 2024
src/test/java/org/jsoup/parser/StreamParserTest.java Dismissed Show dismissed Hide dismissed
src/test/java/org/jsoup/integration/servlets/SlowRider.java Dismissed Show dismissed Hide dismissed
src/main/java/org/jsoup/helper/HttpConnection.java Dismissed Show resolved Hide resolved
src/test/java/org/jsoup/parser/StreamParserTest.java Dismissed Show dismissed Hide dismissed
src/test/java/org/jsoup/parser/StreamParserTest.java Dismissed Show dismissed Hide dismissed
src/test/java/org/jsoup/parser/StreamParserTest.java Dismissed Show dismissed Hide dismissed
src/main/java/org/jsoup/parser/TreeBuilder.java Fixed Show resolved Hide resolved
src/main/java/org/jsoup/parser/TreeBuilder.java Fixed Show resolved Hide resolved
Was failing on CI build for Mac.
Vs an UncheckedIOException.

Most users of the StreamParser will be parsing from an InputStream (disk IO or network access) and so these are liable to throw. The StreamParser is autocloseable so will be used in a try with resources block, so no extra burden to catch these.
@jhy jhy added the improvement label Jan 5, 2024
@jhy jhy self-assigned this Jan 5, 2024
@jhy jhy merged commit 2b443df into master Jan 5, 2024
12 checks passed
@jhy jhy deleted the stream-parser branch January 5, 2024 00:14
@jhy
Copy link
Owner Author

jhy commented Jan 5, 2024

All: if you're interested in this feature, it would be great if you could try it out before the next release by installing a snapshot version and bashing on it. Please comment here with any feedback (what works / doesn't work) and any suggestions on the API.

See details for building a snapshot on the download page.

@821938089
Copy link
Contributor

Missing parseBodyFragment.

Is it possible to implement the removal of unmatched nodes during the select process?

@jhy
Copy link
Owner Author

jhy commented Jan 7, 2024

@821938089 can you give a step-by-step example of what you mean by removing unmatched nodes? Or a sample implementation of selectNext(Evaluator eval) that would do it? It seems like it would be easy to trash the document.

@821938089
Copy link
Contributor

There seems to be no way to do immediate removal of mismatched nodes.
I have an imperfect way to do this: design a finite node cache set queue, put each parsed node into it, and remove the oldest node if it adds to the limit.
If a matching node is found remove it from the queue.

@jhy
Copy link
Owner Author

jhy commented Jan 8, 2024

It's not clear to me what you are trying to get - can you give me a step by step of the input -> selector -> parsed DOM -> removal?

If we just removed every non-matching node after a select, that's going to get a lot of collateral damage.

@821938089
Copy link
Contributor

If you have a very large document and the things you're interested in are scattered throughout the document or at the end of the document, you're going to cause OOM when parsing because the entire document will be in memory.
Your implementation of select looks like it's streaming, but it's using more and more memory.
So I was wondering if there is a way to remove the mismatched nodes to prevent OOM.

@jhy
Copy link
Owner Author

jhy commented Jan 9, 2024

Yes I am very aware of the risk of OOM, hence in the selectNext example above (Process a file in chunks), the el.remove() call. Or, the document() also is available for other removals. But I am not sure of a safe way to include an "auto-remove" kind of option in a generic way. Which is why I am asking for a stepped example of what kind of input you have, your selector, and which elements you want removed.

I don't think auto-removing the unmatched nodes makes sense in all circumstances as that could be removing elements above and below the selected element, or there may be siblings in the tree what are required later also.

One option might be to remove the previously selected & returned elements. That would help in the example above as a chunk process. But these don't seem very general.

jhy added a commit that referenced this pull request Jan 10, 2024
@jhy
Copy link
Owner Author

jhy commented Jan 10, 2024

I've added fragment parse options (and a completeFragment to pick up the relevant nodes) with 1f1f72d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants