Skip to content

Commit

Permalink
Merge branch 'jhy#1294'
Browse files Browse the repository at this point in the history
final merge for the progress report
  • Loading branch information
WinstonHuTiger committed May 8, 2020
2 parents b18df3a + 5a995d6 commit fcd2872
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 5 deletions.
50 changes: 50 additions & 0 deletions Issues/IssueFixing1294.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## Issue 1294

### User Story

Jsoup, as a powerful third-part library to download and parse html segment from remote servers, had been widely used in various Java-write crawlers. When crawlers work, it parse various html tags by multi levels. Each level represent a tag, and a corresponding query syntax. If Jsoup cannot deliver right level logic, crawler will not work normally as what developers expected.

### Problem Description

Given a ruby html segment, Jsoup cannot parse it correctly.

![](https://imgur.com/irw8ZVa.png)

![](https://imgur.com/7nM7ckt.png)

The results in the second picture shows that Jsoup terminate `<rtc>`tag ahead of schedule. What we respect here is

```html
<html>
<head></head>
<body>
<ruby>
<rtc>
<rt>2</rt>
</rtc>
</ruby>
</body>
</html>
```

### Problem Reasons

This problem is caused by the implementation that how Jsoup handle tags. It create an enumeration that lists all possible tags it would meet.(more details could be seen in `src.main.java.org.jsoup.parser.HtmlTreeBuildState.java`) For those unknown tags, Jsoup will close it directly. However, it add a special judgment for ruby tags. More detailed code could be seen in the following code segment which is cut from original project.

![](https://imgur.com/CUX6qXL.png)

It will firstly check whether `<ruby>` tag had been pushed into stack since tag `<rp>` and `<rt>` must appeared as a child node of ruby node. Besides, it will generate end tag and pop out all node until its current node is ruby node. This logic implies that only `<rb>` and `<rt>` must be direct child of Jsoup.

### Problem Solution

There will be not more than `rb` and `rt` tag in `ruby` tag. Besides, someone could add `div` or `span` to which is meaningless but to divide up some area which maybe used to apply cascading style sheet. For instance, if we add an div tag in it, it is definitely wrong to close it in advance.

![](https://imgur.com/hJ0zwPu.png)

So the best way to deal with it is to deal it as a normal tag and just use default way to handle it.

![](https://imgur.com/X9CXqko.png)

Now Jsoup can process it correctly.

![](https://imgur.com/aNawR6S.png)
12 changes: 7 additions & 5 deletions src/main/java/org/jsoup/parser/HtmlTreeBuilderState.java
Original file line number Diff line number Diff line change
Expand Up @@ -604,12 +604,14 @@ private boolean inBodyStartTag(Token t, HtmlTreeBuilder tb) {
case "rp":
case "rt":
if (tb.inScope("ruby")) {
tb.generateImpliedEndTags();
if (!tb.currentElement().normalName().equals("ruby")) {
tb.error(this);
tb.popStackToBefore("ruby"); // i.e. close up to but not include name
}
tb.reconstructFormattingElements();
tb.insert(startTag);
// tb.generateImpliedEndTags();
// if (!tb.currentElement().normalName().equals("ruby")) {
// tb.error(this);
// tb.popStackToBefore("ruby"); // i.e. close up to but not include name
// }
// tb.insert(startTag);
}
// todo - is this right? drops rp, rt if ruby not in scope?
break;
Expand Down

0 comments on commit fcd2872

Please sign in to comment.