Implement `target-counter` to create table of contents #23

sander76 · 2012-12-13T12:24:31Z

I'd like to automatically create a table of contents in my document.
I am thinking of using a small piece of Javascript....
But I have a feeling WeasyPrint doesn't process javascript.. ?
Or are there other ways of doing this ?

Sander.

SimonSapin · 2012-12-13T12:45:38Z

Indeed, WeasyPrint completely ignores JavaScript.

If you’re generating HTML from something else, maybe you can generate a table of content at the same time. For example docutils can do this, with with source files in reStructuredText format.

Otherwise, you could parse an HTML document with lxml, manipulate it in Python with the lxml API, and pass the lxml tree to WeasyPrint with HTML(tree=something). You could consider Python as WeasyPrint’s javascript ;)

Unfortunately in any case, you won’t get page numbers in this table of contents. That would require something in CSS like target-counter() which really needs to be in WeasyPrint’s layout engine.

sander76 · 2012-12-13T19:32:20Z

Okay thanks.
About target-counter() : It is not implemented yet into the layout engine is it ? (trying it out, but throws errors)

SimonSapin · 2012-12-13T19:36:06Z

No, it’s not implemented at all. We’re thinking about how to do it, but it’s not obvious at all. It’s also somewhat low-priority.

When you say it throws errors, is it a message logged on stderr or do you get a Python exception with a full traceback? The latter would be a bug.

sander76 · 2012-12-14T10:30:47Z

I guess there are no other options to create a TOC with page reference then ?

SimonSapin · 2012-12-14T10:52:50Z

No, it’s not possible without a lot of work in WeasyPrint itself, or without very dirty hacks.

I really recommend not to, but if you want to go with the dirty hack look at the Document.make_bookmark_tree() method. The problem is that by the time you can use it, the whole layout is done so it’s very contrived to add content at this point.

API: http://weasyprint.org/docs/api/#weasyprint.document.Document.make_bookmark_tree
Usage example: http://weasyprint.org/docs/tutorial/#individual-pages-meta-data-other-output-formats

ksaylor · 2013-02-07T23:10:46Z

WeasyPrint is almost the perfect solution for me - except the fact that it doesn't support target-counter yet :-( I do use target-counter for getting the page numbers as you have suggested.

SimonSapin · 2013-02-08T08:16:24Z

About target-counter() in CSS:

Adding it for counters other than page or pages shouldn’t be too hard.
page is still straightforward if the target is earlier in the document: the page number of the target is known when we need it.
Other cases are much more tricky: the counter value is not known by the time we need it. This would require some heuristics or iterative algorithm. Unfortunately this is the common case, with eg. a table of contents at the beginning of a document. It is for similar reasons that you need to run LaTeX twice (or sometimes more) for it to get cross-references right.

gnapse · 2014-06-19T19:41:24Z

Hey there. I was wondering if this is still not supported? I'm considering migrating from wkhtmltopdf to WeasyPrint, because of their lousy support for page-break-after: avoid, but loosing the ability to generate a table of contents with page numbers would be a big deal.

SimonSapin · 2014-06-19T20:43:32Z

@gnapse Indeed, not much progress on this front since the discussion above. In addition do the trickiness described, it’s also a matter of someone doing the work.

liZe · 2014-06-20T15:28:33Z

I've tried for fun to implement target-counter last week-end, and I can confirm that it really needs a lot of work, even with very dirty hacks ;).

bitdivine · 2014-08-31T17:01:18Z

Are there some simple cases that can be attacked first? For example, the table of contents typically appears at the beginning of a document, so it appears to be vulnerable to the LaTeX instability, however the pages at the beginning of a document are frequently numbered with Roman numerals, so the fact that the table of contents is at the beginning doesn't actually matter if it indexes only the main body of a document. I haven't looked at the code in any depth yet but I guess I will soon! Realistically I don't have much time so if it's that complicated I'll fail but I can offer a crate of beer to anyone who gets this in :-) http://comicsagogo.files.wordpress.com/2011/10/asterix-and-british-food2.jpg

SimonSapin · 2014-11-10T17:13:49Z

@bitdivine The problem is with finding out the page number for elements that are later in the document (which haven’t been processed yet.) This is unfortunately the common case for table of contents at the beginning of a document.

bitdivine · 2014-11-11T23:55:28Z

How about something like this: Push the header, including the table of contents, to the end of the file. Now page numbers aren't a problem. Finally in the function "layout_document" roll the end back to the beginning: pages = pages[index_of_head:]+pages[:index_of_head]? However what is a nice way of parameterising the shuffle? Can we use omega notation for the page numbers? I can try this when I get home.

bitdivine · 2014-11-12T00:07:32Z

Here is a proof of concept. It reverses the pages and the first page (in my toy document) is page 14 and labelled a such:

bitdivine@dc99c22

Can we replace that reverse with sort(...) of some kind? Is a good way of representing this in HTML to have divs that enclose booklets and order first by booklet, then by page number?

bitdivine · 2014-11-12T00:16:16Z

Oops - fix:

-    rendered_pages = rendered_pages.reverse()
+    rendered_pages.reverse()

SimonSapin · 2014-11-12T06:01:37Z

If you call .render() on an HTML object, you get a Document object whose .pages attribute is a list of Page objects. You can shuffle this list all you want before calling e.g. .write_pdf() on the document, without modifying the layout code. I suppose you could use the pages’ anchors attribute to check which page contains a given element, if you give it an ID: e.g. <h2 id=toc> and if 'toc' in page.anchors: ...

See details in the documentation: http://weasyprint.org/docs/api/

So yeah, I suppose we could add limited support for target-counter() that only works if the target is earlier in the document, then you could have a TOC at the end and shuffle pages to put it at the beginning again. I guess you’d want something like @page :first { counter-reset: page 2 }, not sure if that works right now. Maybe named pages (#57) would be useful there as well.

This sounds reasonable. It’s just a matter of someone doing the work now. Unfortunately nobody is actively working on WeasyPrint at the moment.

SimonSapin · 2014-11-12T06:03:04Z

Right, the counter-reset: page thing is bug #93.

mzu · 2015-03-16T13:21:16Z

Hi. Are there any news on implementing support for target-counter()? Are there any alternative ways to get a ToC with page numbers working with WeasyPrint?

SimonSapin · 2015-03-16T13:40:51Z

@mzu, please refer to #23 (comment).

eenblam · 2015-08-06T16:26:36Z

Did anything happen related to the proof of concept @bitdivine posted last year? The docs don't seem to suggest so, this ~~question~~ issue is still open, the relevant source doesn't look like it supports it, and @bitdivine doesn't seem to have gone any farther with it.

If not, I might hack on it a bit and report back with some toy examples.

SimonSapin · 2015-08-06T18:12:14Z

@ingcake The last paragraph of #23 (comment) is still relevant.

eenblam · 2015-08-06T18:31:08Z

Thanks. I'll have a look at it sometime in the next week.

bitdivine · 2015-08-06T18:31:18Z

I'm afraid I haven't followed up, and I'm unlikely to have time to do so in the near future. I can but wish you bonne chance!

eenblam · 2015-08-09T17:13:51Z

I think I see a strategy for tackling the ToC issue as well as front matter in general. My concern is that this might fall into the aforementioned "very dirty hack" territory, and it requires fixing named pages.

The user could label specific content with id=titlepage, class=frontmatter, class=content. The document itself would also follow this order - the user would not place the table of contents at the end of the document. Then, we'd do something along these lines:

Prior to rendering, append a copy of the front matter to the end of the document. Leave the original in place to serve as a dummy of sorts.
The front matter receives page numbers via counter(frontmatter, lower-roman). This counter needs to be reset so that the copy at the back of the document has the right page numbering.
Call render(). The original (dummy) front matter will provide page numbers for the table of contents that lives in the copied front matter at the end of the document.
Remove the dummy front matter using page.anchors, and replace it with the copied front matter at the end of the document.

If the above doesn't sound too hackish, I'd be interested in working on named pages (#57). With that worked out, the above should be far easier to implement. It seems like it would have a nontrivial impact on the WeasyPrint spec, though. EDIT: Namely, the user would need to know what id's WeasyPrint will look for.

mpicard · 2016-05-19T15:13:37Z

bump? I really need a TOC, any update on named pages?

liZe · 2016-05-23T12:19:47Z

@mpicard bump? I really need a patch ;)

mpicard · 2016-05-23T16:26:46Z

@liZe bring me up to speed? From what I saw named pages is required as well?

liZe · 2016-05-24T23:12:04Z

@liZe I gave a similar disclaimer for my approach, so I'm not arguing that my proposed solution is at all ideal.

Yes, no offense!

EDIT: The related CSS development doesn't seem like a priority for W3 at the moment. Perhaps a hack solution would be helpful in the meantime. It's possible that I can allocate time for this soon, since it's relevant to an upcoming project, but I can't make any promises today. I'm also not opposed to assisting with your (more sustainable) solution as time allows.

Cool! Adding a table of contents (tables and pages) at the end of the document is easy with only Python and lxml.

I really recommend not to, but if you want to go with the dirty hack look at the Document.make_bookmark_tree() method. The problem is that by the time you can use it, the whole layout is done so it’s very contrived to add content at this point.

@SimonSapin is too shy to admit that his ideas are generally damn good. I really recommend not to follow his "I really recommend not to", just try to play with his awful dirty hack: it's a good way to understand how it currently works, what can be done (adding the ToC at the end) and what's impossible without big changes (getting page numbers before rendering the whole document, getting the titles without lxml, etc.)

mpicard · 2016-05-25T11:49:28Z

@eenblam send me a message if you decide to dig into this, I need a fix for this as well so I would be willing to help out and discuss with you what I find.

cliuser · 2016-12-04T22:26:59Z

Too bad you can't handle JS. Phil Schatz's css-polyfills.js shim handles a slew of CSS3 generated content. I have a demo HTML doc with autogenerated TOC, LOF, LOT, and Acronym sections (MIL-STD style). All it needs is leader() for the front matter.

doronhorwitz · 2017-07-01T18:23:31Z

I do realise that this issue had its name changed to handle the implementation of 'target-counter', however, whenever you search for information about generating a table of contents using WeasyPrint, you land up here, and the contents of this issue thread make it seem as if generating the table of contents isn't straightforward. But I'd just like to emphasize that it is straightforward, making WeasyPrint even more appealing. The 3rd comment by @SimonSapin at the beginning of the thread basically explains how to do it, but I'd just like to outline what I did so that anyone coming back here doesn't leave as disappointed as I originally left, because as I said, it is possible and straightforward (and does not use "very dirty hacks" nor is it "contrived", as @SimonSapin put it):

generate the document object of the content you're making into a PDF using <h1>...<h6> tags to generate bookmarks, something like:

html = """
<h1>A Title</h1>
<p>Some content</p>
<h2>A subtitle</h2>
<p>Some more content</p>
"""
document = HTML(string=html).render()

To generate the table of contents string, use the code slightly modified from this link (from the WeasyPrint website!) - search for "Print the outline of the document.":

def generate_outline_str(bookmarks, indent=0):
    outline_str = ""
    for i, (label, (page, _, _), children) in enumerate(bookmarks, 1):
        outline_str += ('%s%d. %s (page %d)' % (
            ' ' * indent, i, label.lstrip('0123456789. '), page))
        outline_str += generate_outline_str(children, indent + 2)
    return outline_str
table_of_contents_string = generate_outline_str(document.make_bookmark_tree())

Note: you'll definitely have to modify this code to format the table of contents in HTML nicely

generate a table of contents page in a new document

table_of_contents_document = HTML(string=table_of_contents_string).render()
table_of_contents_page = table_of_contents_document.pages[0]

insert the table of contents page into the original document

document.pages.insert(0, table_of_contents_page)

write your PDF

document.write_pdf(target='myfile.pdf')

QED

LegoStormtroopr · 2017-08-08T07:22:47Z

@doronhorwitz You are a genius - that little recipe needs to be included in the WeasyPrint documentation, being unable to have tables of contents was one of the reasons I wasn't using this library

benjaoming · 2017-08-08T09:26:49Z

@doronhorwitz how does this approach handle page numbers, say when the table of content is 1, 2, 3 pages long? Can it break the table into several pages?

LegoStormtroopr · 2017-08-08T09:56:42Z

From what I can tell from the code its pretty boiler plate, leaving the actual construction of the "Table of Contents" up to who ever is using it. In this case, if it is HTML there are no breaks so its a big run on blob of text.

Changing the outline_str like this:

    outline_str += ('<div>%s%d. %s <span style="float:right">(page %d)</span></div>' % (
        ' ' * indent, i, label.lstrip('0123456789. '), page))

The above gets page numbers floated on the right (like most tables of contents), the rest of the styling is left as an exercise for the reader, but it would likely be done in a template rather than in code and is indicative of how it would work.

But, the approach I am taking is this:

title_pages = HTML(string=table_of_contents_string).render()
table_of_contents_document = HTML(string=table_of_contents_string).render()

# Number the pages of the main document (eg. using CSS counters)
document = HTML(string=MAIN_DOCUMENTtable_of_contents_string).render()

document.pages.insert(0, title_pages)
document.pages.insert(1, table_of_contents_page)

document.write_pdf(target='myfile.pdf')

Its important to note the "main part of the document" will have the pages number normally, with the title pages and table of contents outside that main flow (much like a regular document anyway), so regardless of how big the table of contents gets its flow doesn't mess the page count of the "actual document".

mb21 · 2017-09-18T14:44:04Z

From what I gather in this issue, the problem is that you'd have to make a second pass to insert the page numbers in the TOC after the rest of the document is laid out. Could we then make the simplified assumption that the page numbers are absolutely positioned (or similar) and thus don't affect the position of the following boxes so we wouldn't have to trigger a reflow? It's not perfect, but it would be very useful... so we wouldn't have to make another layout pass, but simply write some code to insert the (absolutely-positioned) page numbers after all the other layouting is done.

btw, here is some HTML/CSS to demonstrate how target-counter(attr(href), page) would work to create a TOC.

Tontyna · 2018-02-08T00:03:24Z

After diving into WeasyPrint's code for several days I finally brought target-counter() into life. It resolves targets within the limits of current WeasyPrint:

No counters in @page-context except the fix coded pageand pages, see also counter-increment in '@page'? #289.
page and pages aren't available in document context:
```
p::before(
   content: "this paragraph is on page " counter(page);
}
```
always yields 0 like any other undefined counter.
Exchange of counter values only via string-set in document elements and string() in @page-content css.

Conclusion:
The target-counter() doesn't solve the TOC-with-page-numbers problem.
Unless somebody redesigns Weasyprint to make at leas N°2 of the above list possible.

At the moment my implementaion of target-counter is still too debuggy to make a pull-request -- lots of comments and debug-statements that helped me understand how WeasyPrint works, oh my! Python is not my favourite language! But in the next days I'll clean up my code...

liZe · 2018-08-07T14:15:07Z

@Tontyna Fixing #652 also fixes this issue too, doesn't it?

Of course, the original problem is not really solved, as we can't create a TOC in pure CSS. As there's nothing in the spec allowing such a feature, we may just close this issue and add @doronhorwitz's reciepe to the documentation.

Tontyna · 2018-08-07T22:08:50Z

You're right. A TOC requires a script, @doronhorwitz 's is a good starting point. Although the page numbers wont be the right ones when counter-increment/-reset functions are used in the document. And it doesn't require neither target-* nor #652 at all.

With #652 available I'd automate my TOCs with a script that extracts the headings / bookmark-labels and injects a html-snippet like the <ol class="toc"> in the comment.

liZe · 2018-08-08T12:00:00Z

With #652 available I'd automate my TOCs with a script that extracts the headings / bookmark-labels and injects a html-snippet like the
in the comment.

Indeed.

liZe · 2018-10-11T15:54:39Z

I'm closing this issue, as there's nothing more we can do here according to the current CSS specifications. The HTML template engine has to add empty links, see the report sample as an example.

sudarshang · 2018-10-17T11:15:34Z

@liZe could you please link to the code that generates the pdf in the report sample. From #652 it is not clear if @doronhorwitz 's script is no longer required.

liZe · 2018-10-17T11:58:43Z

@liZe could you please link to the code that generates the pdf in the report sample.

I just called weasyprint report.html report.pdf from the report folder.

brokenhoax · 2021-10-13T14:11:59Z

This is possible and I got it working in my project. Just look at the "report" sample on WeasyPrint's site and you'll see how you can get Page Number and add it to your own, custom, table of contents:

https://weasyprint.org/

SimonSapin mentioned this issue Sep 11, 2014

Page references #218

Closed

liZe changed the title ~~Table of contents...~~ Table of contents Aug 14, 2015

liZe changed the title ~~Table of contents~~ Implement target-counter to create table of contents Aug 14, 2015

benjaoming mentioned this issue Oct 16, 2016

Maintenance Roadmap xhtml2pdf/xhtml2pdf#317

Closed

This was referenced Apr 7, 2017

Javascript generated content #454

Closed

How can I generate a table of contents? #457

Closed

Tontyna mentioned this issue Feb 11, 2018

introduce TARGET_COLLECTOR for target-counter, -counters and -text #572

Closed

Tontyna mentioned this issue May 1, 2018

Handle target-* #604

Merged

bitcoinhodler mentioned this issue May 15, 2018

Add page numbers to PDF TOC? joaofnfernandes/glacierprotocol.github.io#9

Open

Tontyna mentioned this issue Jul 8, 2018

Preparing TOC feature #652

Merged

liZe added this to the 43 milestone Oct 11, 2018

liZe closed this as completed Oct 11, 2018

liZe mentioned this issue Mar 6, 2019

Render Javascript Content #817

Closed

EugenMayer mentioned this issue May 22, 2020

ToC with links and numbered headings #1121

Closed

5 tasks

liZe mentioned this issue Jun 13, 2020

How to create index list? #1139

Closed

ritiksoni00 mentioned this issue Mar 10, 2022

print sum of a table column to next page #1592

Closed

anthonyvelazquez mentioned this issue Jan 17, 2023

MCP-2111 WeasyPrint PDF Package department-of-veterans-affairs/abd-vro#876

Closed

Implement target-counter to create table of contents #23

Implement target-counter to create table of contents #23

Comments

sander76 commented Dec 13, 2012

SimonSapin commented Dec 13, 2012

sander76 commented Dec 13, 2012

SimonSapin commented Dec 13, 2012

sander76 commented Dec 14, 2012

SimonSapin commented Dec 14, 2012

ksaylor commented Feb 7, 2013

SimonSapin commented Feb 8, 2013

gnapse commented Jun 19, 2014

SimonSapin commented Jun 19, 2014

liZe commented Jun 20, 2014

bitdivine commented Aug 31, 2014

SimonSapin commented Nov 10, 2014

bitdivine commented Nov 11, 2014

bitdivine commented Nov 12, 2014

bitdivine commented Nov 12, 2014

SimonSapin commented Nov 12, 2014

SimonSapin commented Nov 12, 2014

mzu commented Mar 16, 2015

SimonSapin commented Mar 16, 2015

eenblam commented Aug 6, 2015

SimonSapin commented Aug 6, 2015

eenblam commented Aug 6, 2015

bitdivine commented Aug 6, 2015

eenblam commented Aug 9, 2015

mpicard commented May 19, 2016 • edited Loading

liZe commented May 23, 2016

mpicard commented May 23, 2016

liZe commented May 24, 2016

mpicard commented May 25, 2016

cliuser commented Dec 4, 2016

doronhorwitz commented Jul 1, 2017 • edited Loading

LegoStormtroopr commented Aug 8, 2017

benjaoming commented Aug 8, 2017

LegoStormtroopr commented Aug 8, 2017

mb21 commented Sep 18, 2017

Tontyna commented Feb 8, 2018

liZe commented Aug 7, 2018

Tontyna commented Aug 7, 2018

liZe commented Aug 8, 2018

liZe commented Oct 11, 2018 • edited Loading

sudarshang commented Oct 17, 2018

liZe commented Oct 17, 2018 • edited Loading

brokenhoax commented Oct 13, 2021

Implement `target-counter` to create table of contents #23

Implement `target-counter` to create table of contents #23

mpicard commented May 19, 2016 •

edited

Loading

doronhorwitz commented Jul 1, 2017 •

edited

Loading

liZe commented Oct 11, 2018 •

edited

Loading

liZe commented Oct 17, 2018 •

edited

Loading