Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookmarks are not working with JSoup #408

Closed
Milchreis opened this issue Nov 15, 2019 · 2 comments
Closed

Bookmarks are not working with JSoup #408

Milchreis opened this issue Nov 15, 2019 · 2 comments

Comments

@Milchreis
Copy link

Milchreis commented Nov 15, 2019

If you generate a PDF with bookmarks with JSoup the bookmarks are not included and no error message is thrown.

Example

String html = "<html>\n" +
        "<head>\n" +
        "<style>\n" +
        "div {\n" +
        "\tpage-break-after: always;\n" +
        "}\n" +
        "#toc {\n" +
        "    width: 100%;\n" +
        "    border-collapse: collapse;\n" +
        "}\n" +
        "#toc .page-number::after {\n" +
        "  /* SPECIAL STUFF HERE! */\n" +
        "  content: target-counter(attr(href), page);\n" +
        "  width: 30px;\n" +
        "}\n" +
        "</style>\n" +
        "<bookmarks>\n" +
        "  <bookmark name=\"Title of element on page 1\" href=\"#page-1\"/>\n" +
        "  <bookmark name=\"Title of element on page 2\" href=\"#page-2\"/>\n" +
        "  <bookmark name=\"Title of element on page 3\" href=\"#page-3\"/>\n" +
        "  <bookmark name=\"Title of element on page 4\" href=\"#page-4\"/>\n" +
        "</bookmarks>\n" +
        "</head>\n" +
        "<body>\n" +
        "<h1>Bookmarks and TOC example</h1>\n" +
        "\n" +
        "<h2>TOC</h2>\n" +
        "<table id=\"toc\">\n" +
        "  <tr><td><a href=\"#page-1\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-1\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-2\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-2\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-3\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-3\"></td></tr>\n" +
        "  <tr><td><a href=\"#page-4\">Title of element on page</a></td><td class=\"page-number\" href=\"#page-4\"></td></tr>\n" +
        "</table>\n" +
        "\n" +
        "<div id=\"page-1\">Page 1</div>\n" +
        "<div id=\"page-2\">Page 2</div>\n" +
        "<div id=\"page-3\">Page 3</div>\n" +
        "<div id=\"page-4\">Page 4</div>\n" +
        "\n" +
        "</body>\n" +
        "</html>";

org.jsoup.nodes.Document doc = Jsoup.parse(html);

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(new W3CDom().fromJsoup(doc), null);
// builder.withHtmlContent(html, null);  <- this works with bookmarks

builder.toStream(output);
builder.run();
@syjer
Copy link
Contributor

syjer commented Nov 15, 2019

hi @Milchreis ,

this is caused by the html5 parsing rules.

if you do a doc.getOuterHtml(); you will get the following representation:

<html>
 <head> 
  <style>
div {
	page-break-after: always;
}
#toc {
    width: 100%;
    border-collapse: collapse;
}
#toc .page-number::after {
  /* SPECIAL STUFF HERE! */
  content: target-counter(attr(href), page);
  width: 30px;
}
</style> 
 </head>
 <body>
  <bookmarks> 
   <bookmark name="Title of element on page 1" href="#page-1" /> 
   <bookmark name="Title of element on page 2" href="#page-2" /> 
   <bookmark name="Title of element on page 3" href="#page-3" /> 
   <bookmark name="Title of element on page 4" href="#page-4" /> 
  </bookmarks>   
  <h1>Bookmarks and TOC example</h1> 
  <h2>TOC</h2> 
  <table id="toc"> 
   <tbody>
    <tr>
     <td><a href="#page-1">Title of element on page</a></td>
     <td class="page-number" href="#page-1"></td>
    </tr> 
    <tr>
     <td><a href="#page-2">Title of element on page</a></td>
     <td class="page-number" href="#page-2"></td>
    </tr> 
    <tr>
     <td><a href="#page-3">Title of element on page</a></td>
     <td class="page-number" href="#page-3"></td>
    </tr> 
    <tr>
     <td><a href="#page-4">Title of element on page</a></td>
     <td class="page-number" href="#page-4"></td>
    </tr> 
   </tbody>
  </table> 
  <div id="page-1">
   Page 1
  </div> 
  <div id="page-2">
   Page 2
  </div> 
  <div id="page-3">
   Page 3
  </div> 
  <div id="page-4">
   Page 4
  </div>  
 </body>
</html>

You can notice how the bookmarks have been moved from the head to the body. In the code, we can see that it will fetch the bookmarks only in the head ( DOMUtil.getChild(head, "bookmarks") ) so they will not be found.

Note: I've tried with my html5 parser (https://github.com/digitalfondue/jfiveparse) and the output is a little different(note how the self closing "bookmark" elements are interpreted):

<html><head>
<style>
div {
	page-break-after: always;
}
#toc {
    width: 100%;
    border-collapse: collapse;
}
#toc .page-number::after {
  /* SPECIAL STUFF HERE! */
  content: target-counter(attr(href), page);
  width: 30px;
}

</style>
</head><body><bookmarks>
  <bookmark name="Title of element on page 1" href="#page-1">
  <bookmark name="Title of element on page 2" href="#page-2">
  <bookmark name="Title of element on page 3" href="#page-3">
  <bookmark name="Title of element on page 4" href="#page-4">
</bookmark></bookmark></bookmark></bookmark></bookmarks>


<h1>Bookmarks and TOC example</h1>

<h2>TOC</h2>
<table id="toc">
  <tbody><tr><td><a href="#page-1">Title of element on page</a></td><td class="page-number" href="#page-1"></td></tr>
  <tr><td><a href="#page-2">Title of element on page</a></td><td class="page-number" href="#page-2"></td></tr>
  <tr><td><a href="#page-3">Title of element on page</a></td><td class="page-number" href="#page-3"></td></tr>
  <tr><td><a href="#page-4">Title of element on page</a></td><td class="page-number" href="#page-4"></td></tr>
</tbody></table>


<div id="page-1">Page 1</div>
<div id="page-2">Page 2</div>
<div id="page-3">Page 3</div>
<div id="page-4">Page 4</div>


</body></html>

Which is even more correct, as chrome will interpret the html the same way:
Screenshot from 2019-11-15 14-55-55

So I guess that DOMUtil.getChild(head, "bookmarks") should also look in the body as a fallback.

I think that I can provide a PR for that, @danfickle what do you think?

@Milchreis
Copy link
Author

Thank you guys. Waiting for the next release 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants