You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is mostly to keep track of my investigation of this oss-fuzz–found bug
Minimal test case with max_tree_depth: 1:
<html><body>
The issue seems to occur if the tree depth increases beyond the limit due to a parser inserted node when the current token needs to be reprocessed in a new insertion mode.
In more detail, the gumbo parser works on tokens produced by the tokenizer. The standard requires tokens to be reprocessed in a different insertion mode in a variety of circumstances. In a few circumstances, the parser will insert a node into the tree before reprocessing. If this occurs and the depth increases beyond the limit, then memory allocated for the token will be leaked. (I think, I'm still debugging.)
Running this with gumbo debugging turned on (and some additional, uncommitted print statements),
Parsing <html><body>.
Lexing character '<' (60) in state 0.
Lexing character 'h' (104) in state 5.
Starting new tag.
Lexing character 'h' (104) in state 7.
Lexing character 't' (116) in state 7.
Lexing character 'm' (109) in state 7.
Lexing character 'l' (108) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag html.
Original text = <html>.
Handling html token @1:1 in state 0.
Adding parse error.
Handling html token @1:1 in state 1.
Inserting <html> element (@0x6000021141e0) from token.
Lexing character '<' (60) in state 0.
Lexing character 'b' (98) in state 5.
Starting new tag.
Lexing character 'b' (98) in state 7.
Lexing character 'o' (111) in state 7.
Lexing character 'd' (100) in state 7.
Lexing character 'y' (121) in state 7.
Lexing character '>' (62) in state 7.
Emitted start tag body.
Original text = <body>.
Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.
start tag: 0x0
Tree depth limit exceeded.
Finishing parsingPopping head node.
Popping html node.
The key portion showing what's happening is
Handling body token @1:7 in state 2.
Current node: <html>.
Inserting head element (@0x600002114e60) from tag type.
state 2 here should be insertion mode 2, namely, GUMBO_INSERTION_MODE_BEFORE_HEAD. The before head insertion mode says a <body> token should be handled by inserting a head element and reprocessing in the in head insertion mode.
This pushes the tree depth to 2 and we stop parsing at that point. No memory allocated for the <body> token is transferred to the DOM and thus it leaks.
If the example is changed to <html><head>, then the parser handles <head> by creating a head element and no memory is leaked, even though the max tree depth is exceeded.
The solution to this seems to be to delay checking the max tree depth until the token is completely processed. Unfortunately, this fix doesn't actually enable the test case from oss-fuzz to pass. It causes an assert to fire which I think is indicative of another bug but I haven't tracked that down either.
The oss-fuzz test case contains </tr$<a> as the final token.
Lexing character '<' (60) in state 0.
Lexing character '/' (47) in state 5.
Lexing character 't' (116) in state 6.
Starting new tag.
Lexing character 't' (116) in state 7.
Lexing character 'r' (114) in state 7.
Lexing character '$' (36) in state 7.
Lexing character '<' (60) in state 7.
Lexing character 'a' (97) in state 7.
Lexing character '>' (62) in state 7.
Emitted end tag .
Original text = </tr$<a>.
Handling token @1:2888 in state 9.
Current node: <tr>.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Adding parse error.
Setting frameset_ok to false.
Reconstructing elements from 333 on tr parent.
Reconstructed a element at 333.
Flushing text node buffer of s>R>o .
end token: 0x600003106020 151
Tree depth limit exceeded.
Finishing parsingPopping a node.
...
I haven't managed to minimize this test case yet but the Reconstructing elements... and Reconstructed a element... involves inserting new nodes into the tree and I think that's what's pushing it over the limit.
I'd like to understand this a bit better before I'm confident that the change to delay stopping the parse until fixes this issue (and also figure out what is causing the assertion to fail on the oss-fuzz test case).
The text was updated successfully, but these errors were encountered:
When a token causes a node to be added to the DOM which increases the
depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to
be reprocessed, memory is leaked. By delaying breaking out of the loop
until after the token has been completely handled, this appears to fix
the leak.
Fixessparklemotion#3098
When a token causes a node to be added to the DOM which increases the
depth of the DOM to exceed the `max_tree_depth` _and_ the token needs to
be reprocessed, memory is leaked. By delaying breaking out of the loop
until after the token has been completely handled, this appears to fix
the leak.
Fixes#3098
This is mostly to keep track of my investigation of this oss-fuzz–found bug
Minimal test case with
max_tree_depth: 1
:The issue seems to occur if the tree depth increases beyond the limit due to a parser inserted node when the current token needs to be reprocessed in a new insertion mode.
In more detail, the gumbo parser works on tokens produced by the tokenizer. The standard requires tokens to be reprocessed in a different insertion mode in a variety of circumstances. In a few circumstances, the parser will insert a node into the tree before reprocessing. If this occurs and the depth increases beyond the limit, then memory allocated for the token will be leaked. (I think, I'm still debugging.)
Running this with gumbo debugging turned on (and some additional, uncommitted print statements),
gives the following output:
The key portion showing what's happening is
state 2
here should be insertion mode 2, namely,GUMBO_INSERTION_MODE_BEFORE_HEAD
. Thebefore head
insertion mode says a<body>
token should be handled by inserting ahead
element and reprocessing in thein head
insertion mode.This pushes the tree depth to 2 and we stop parsing at that point. No memory allocated for the
<body>
token is transferred to the DOM and thus it leaks.If the example is changed to
<html><head>
, then the parser handles<head>
by creating ahead
element and no memory is leaked, even though the max tree depth is exceeded.The solution to this seems to be to delay checking the max tree depth until the token is completely processed. Unfortunately, this fix doesn't actually enable the test case from oss-fuzz to pass. It causes an assert to fire which I think is indicative of another bug but I haven't tracked that down either.
The oss-fuzz test case contains
</tr$<a>
as the final token.I haven't managed to minimize this test case yet but the
Reconstructing elements...
andReconstructed a element...
involves inserting new nodes into the tree and I think that's what's pushing it over the limit.I'd like to understand this a bit better before I'm confident that the change to delay stopping the parse until fixes this issue (and also figure out what is causing the assertion to fail on the oss-fuzz test case).
The text was updated successfully, but these errors were encountered: