-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progressive parsing with StreamParser #2096
Conversation
Was failing on CI build for Mac.
Vs an UncheckedIOException. Most users of the StreamParser will be parsing from an InputStream (disk IO or network access) and so these are liable to throw. The StreamParser is autocloseable so will be used in a try with resources block, so no extra burden to catch these.
All: if you're interested in this feature, it would be great if you could try it out before the next release by installing a snapshot version and bashing on it. Please comment here with any feedback (what works / doesn't work) and any suggestions on the API. See details for building a snapshot on the download page. |
Missing parseBodyFragment. Is it possible to implement the removal of unmatched nodes during the select process? |
@821938089 can you give a step-by-step example of what you mean by removing unmatched nodes? Or a sample implementation of |
There seems to be no way to do immediate removal of mismatched nodes. |
It's not clear to me what you are trying to get - can you give me a step by step of the input -> selector -> parsed DOM -> removal? If we just removed every non-matching node after a select, that's going to get a lot of collateral damage. |
If you have a very large document and the things you're interested in are scattered throughout the document or at the end of the document, you're going to cause OOM when parsing because the entire document will be in memory. |
Yes I am very aware of the risk of OOM, hence in the selectNext example above (Process a file in chunks), the I don't think auto-removing the unmatched nodes makes sense in all circumstances as that could be removing elements above and below the selected element, or there may be siblings in the tree what are required later also. One option might be to remove the previously selected & returned elements. That would help in the example above as a chunk process. But these don't seem very general. |
I've added fragment parse options (and a completeFragment to pick up the relevant nodes) with 1f1f72d |
A StreamParser provides a progressive parse of its input. As each Element is completed, it is emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable.
Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements.
Additionally, the parser provides a
selectFirst(String query)
/selectNext(String query)
, which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via anotherselect()
call, or via thestream()
oriterator()
methods.Once the input has been fully read, the input Reader will be closed. Or, if the whole document does not need to be read, call
stop()
andclose()
.The
document()
method will return the Document being parsed into, which will be only partially complete until the input is fully consumed.A StreamParser can be reused via a new
parse(Reader, String)
, but is not thread-safe for concurrent inputs. New parsers should be used in each thread.If created via
Connection.Response#streamParser()
, or another Reader that is I/O backed, the iterator andstream consumers will throw an
java.io.UncheckedIOException
if the underlying Reader errors during read.The StreamParser interface is currently in beta and may change in subsequent releases. Feedback on the feature and how you're using it is very welcome via the jsoup discussions .
Examples
Process a file in chunks
Assuming we have a file with a bunch of
<book>
chunks each with many<chapter>
elements, but loading it all into the DOM at once might run out of memory. Process the file in chunks by iterating onselectNext(cssquery)
:Parse just the meta data of a website
Assume we are building a link preview tool. All the data we need is in the head section of a page, and so there's no need to fetch and parse the complete page. This example will fetch a given URL, parse only the
<head>
contents and use those, and then cleanly close the request:Minify the loaded DOM by removing empty text nodes
This example shows a way to progressively parse an input and remove redundant empty textnodes during the parse, resulting in a (slightly) minified DOM: