Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement target-counter to create table of contents #23

Closed
sander76 opened this issue Dec 13, 2012 · 45 comments
Closed

Implement target-counter to create table of contents #23

sander76 opened this issue Dec 13, 2012 · 45 comments
Labels
feature New feature that should be supported
Milestone

Comments

@sander76
Copy link

I'd like to automatically create a table of contents in my document.
I am thinking of using a small piece of Javascript....
But I have a feeling WeasyPrint doesn't process javascript.. ?
Or are there other ways of doing this ?

Sander.

@SimonSapin
Copy link
Member

Indeed, WeasyPrint completely ignores JavaScript.

If you’re generating HTML from something else, maybe you can generate a table of content at the same time. For example docutils can do this, with with source files in reStructuredText format.

Otherwise, you could parse an HTML document with lxml, manipulate it in Python with the lxml API, and pass the lxml tree to WeasyPrint with HTML(tree=something). You could consider Python as WeasyPrint’s javascript ;)

Unfortunately in any case, you won’t get page numbers in this table of contents. That would require something in CSS like target-counter() which really needs to be in WeasyPrint’s layout engine.

@sander76
Copy link
Author

Okay thanks.
About target-counter() : It is not implemented yet into the layout engine is it ? (trying it out, but throws errors)

@SimonSapin
Copy link
Member

No, it’s not implemented at all. We’re thinking about how to do it, but it’s not obvious at all. It’s also somewhat low-priority.

When you say it throws errors, is it a message logged on stderr or do you get a Python exception with a full traceback? The latter would be a bug.

@sander76
Copy link
Author

I guess there are no other options to create a TOC with page reference then ?

@SimonSapin
Copy link
Member

No, it’s not possible without a lot of work in WeasyPrint itself, or without very dirty hacks.

I really recommend not to, but if you want to go with the dirty hack look at the Document.make_bookmark_tree() method. The problem is that by the time you can use it, the whole layout is done so it’s very contrived to add content at this point.

API: http://weasyprint.org/docs/api/#weasyprint.document.Document.make_bookmark_tree
Usage example: http://weasyprint.org/docs/tutorial/#individual-pages-meta-data-other-output-formats

@ksaylor
Copy link

ksaylor commented Feb 7, 2013

WeasyPrint is almost the perfect solution for me - except the fact that it doesn't support target-counter yet :-( I do use target-counter for getting the page numbers as you have suggested.

@SimonSapin
Copy link
Member

About target-counter() in CSS:

  • Adding it for counters other than page or pages shouldn’t be too hard.
  • page is still straightforward if the target is earlier in the document: the page number of the target is known when we need it.
  • Other cases are much more tricky: the counter value is not known by the time we need it. This would require some heuristics or iterative algorithm. Unfortunately this is the common case, with eg. a table of contents at the beginning of a document. It is for similar reasons that you need to run LaTeX twice (or sometimes more) for it to get cross-references right.

@gnapse
Copy link

gnapse commented Jun 19, 2014

Hey there. I was wondering if this is still not supported? I'm considering migrating from wkhtmltopdf to WeasyPrint, because of their lousy support for page-break-after: avoid, but loosing the ability to generate a table of contents with page numbers would be a big deal.

@SimonSapin
Copy link
Member

@gnapse Indeed, not much progress on this front since the discussion above. In addition do the trickiness described, it’s also a matter of someone doing the work.

@liZe
Copy link
Member

liZe commented Jun 20, 2014

I've tried for fun to implement target-counter last week-end, and I can confirm that it really needs a lot of work, even with very dirty hacks ;).

@bitdivine
Copy link

Are there some simple cases that can be attacked first? For example, the table of contents typically appears at the beginning of a document, so it appears to be vulnerable to the LaTeX instability, however the pages at the beginning of a document are frequently numbered with Roman numerals, so the fact that the table of contents is at the beginning doesn't actually matter if it indexes only the main body of a document. I haven't looked at the code in any depth yet but I guess I will soon! Realistically I don't have much time so if it's that complicated I'll fail but I can offer a crate of beer to anyone who gets this in :-) http://comicsagogo.files.wordpress.com/2011/10/asterix-and-british-food2.jpg

@SimonSapin
Copy link
Member

@bitdivine The problem is with finding out the page number for elements that are later in the document (which haven’t been processed yet.) This is unfortunately the common case for table of contents at the beginning of a document.

@bitdivine
Copy link

How about something like this: Push the header, including the table of contents, to the end of the file. Now page numbers aren't a problem. Finally in the function "layout_document" roll the end back to the beginning: pages = pages[index_of_head:]+pages[:index_of_head]? However what is a nice way of parameterising the shuffle? Can we use omega notation for the page numbers? I can try this when I get home.

@bitdivine
Copy link

Here is a proof of concept. It reverses the pages and the first page (in my toy document) is page 14 and labelled a such:

bitdivine@dc99c22

Can we replace that reverse with sort(...) of some kind? Is a good way of representing this in HTML to have divs that enclose booklets and order first by booklet, then by page number?

@bitdivine
Copy link

Oops - fix:

-    rendered_pages = rendered_pages.reverse()
+    rendered_pages.reverse()

@SimonSapin
Copy link
Member

If you call .render() on an HTML object, you get a Document object whose .pages attribute is a list of Page objects. You can shuffle this list all you want before calling e.g. .write_pdf() on the document, without modifying the layout code. I suppose you could use the pages’ anchors attribute to check which page contains a given element, if you give it an ID: e.g. <h2 id=toc> and if 'toc' in page.anchors: ...

See details in the documentation: http://weasyprint.org/docs/api/

So yeah, I suppose we could add limited support for target-counter() that only works if the target is earlier in the document, then you could have a TOC at the end and shuffle pages to put it at the beginning again. I guess you’d want something like @page :first { counter-reset: page 2 }, not sure if that works right now. Maybe named pages (#57) would be useful there as well.

This sounds reasonable. It’s just a matter of someone doing the work now. Unfortunately nobody is actively working on WeasyPrint at the moment.

@SimonSapin
Copy link
Member

Right, the counter-reset: page thing is bug #93.

@mzu
Copy link

mzu commented Mar 16, 2015

Hi. Are there any news on implementing support for target-counter()? Are there any alternative ways to get a ToC with page numbers working with WeasyPrint?

@SimonSapin
Copy link
Member

@mzu, please refer to #23 (comment).

@eenblam
Copy link

eenblam commented Aug 6, 2015

Did anything happen related to the proof of concept @bitdivine posted last year? The docs don't seem to suggest so, this question issue is still open, the relevant source doesn't look like it supports it, and @bitdivine doesn't seem to have gone any farther with it.

If not, I might hack on it a bit and report back with some toy examples.

@SimonSapin
Copy link
Member

@ingcake The last paragraph of #23 (comment) is still relevant.

@eenblam
Copy link

eenblam commented Aug 6, 2015

Thanks. I'll have a look at it sometime in the next week.

@bitdivine
Copy link

I'm afraid I haven't followed up, and I'm unlikely to have time to do so in the near future. I can but wish you bonne chance!

@eenblam
Copy link

eenblam commented Aug 9, 2015

I think I see a strategy for tackling the ToC issue as well as front matter in general. My concern is that this might fall into the aforementioned "very dirty hack" territory, and it requires fixing named pages.

The user could label specific content with id=titlepage, class=frontmatter, class=content. The document itself would also follow this order - the user would not place the table of contents at the end of the document. Then, we'd do something along these lines:

  • Prior to rendering, append a copy of the front matter to the end of the document. Leave the original in place to serve as a dummy of sorts.
  • The front matter receives page numbers via counter(frontmatter, lower-roman). This counter needs to be reset so that the copy at the back of the document has the right page numbering.
  • Call render(). The original (dummy) front matter will provide page numbers for the table of contents that lives in the copied front matter at the end of the document.
  • Remove the dummy front matter using page.anchors, and replace it with the copied front matter at the end of the document.

If the above doesn't sound too hackish, I'd be interested in working on named pages (#57). With that worked out, the above should be far easier to implement. It seems like it would have a nontrivial impact on the WeasyPrint spec, though. EDIT: Namely, the user would need to know what id's WeasyPrint will look for.

@liZe liZe changed the title Table of contents... Table of contents Aug 14, 2015
@liZe liZe changed the title Table of contents Implement target-counter to create table of contents Aug 14, 2015
@mpicard
Copy link

mpicard commented May 19, 2016

bump? I really need a TOC, any update on named pages?

@liZe
Copy link
Member

liZe commented May 23, 2016

@mpicard bump? I really need a patch ;)

@mpicard
Copy link

mpicard commented May 23, 2016

@liZe bring me up to speed? From what I saw named pages is required as well?

@liZe
Copy link
Member

liZe commented May 24, 2016

@liZe I gave a similar disclaimer for my approach, so I'm not arguing that my proposed solution is at all ideal.

Yes, no offense!

EDIT: The related CSS development doesn't seem like a priority for W3 at the moment. Perhaps a hack solution would be helpful in the meantime. It's possible that I can allocate time for this soon, since it's relevant to an upcoming project, but I can't make any promises today. I'm also not opposed to assisting with your (more sustainable) solution as time allows.

Cool! Adding a table of contents (tables and pages) at the end of the document is easy with only Python and lxml.

I really recommend not to, but if you want to go with the dirty hack look at the Document.make_bookmark_tree() method. The problem is that by the time you can use it, the whole layout is done so it’s very contrived to add content at this point.

@SimonSapin is too shy to admit that his ideas are generally damn good. I really recommend not to follow his "I really recommend not to", just try to play with his awful dirty hack: it's a good way to understand how it currently works, what can be done (adding the ToC at the end) and what's impossible without big changes (getting page numbers before rendering the whole document, getting the titles without lxml, etc.)

@mpicard
Copy link

mpicard commented May 25, 2016

@eenblam send me a message if you decide to dig into this, I need a fix for this as well so I would be willing to help out and discuss with you what I find.

@cliuser
Copy link

cliuser commented Dec 4, 2016

Too bad you can't handle JS. Phil Schatz's css-polyfills.js shim handles a slew of CSS3 generated content. I have a demo HTML doc with autogenerated TOC, LOF, LOT, and Acronym sections (MIL-STD style). All it needs is leader() for the front matter.

@doronhorwitz
Copy link

doronhorwitz commented Jul 1, 2017

I do realise that this issue had its name changed to handle the implementation of 'target-counter', however, whenever you search for information about generating a table of contents using WeasyPrint, you land up here, and the contents of this issue thread make it seem as if generating the table of contents isn't straightforward. But I'd just like to emphasize that it is straightforward, making WeasyPrint even more appealing. The 3rd comment by @SimonSapin at the beginning of the thread basically explains how to do it, but I'd just like to outline what I did so that anyone coming back here doesn't leave as disappointed as I originally left, because as I said, it is possible and straightforward (and does not use "very dirty hacks" nor is it "contrived", as @SimonSapin put it):

  1. generate the document object of the content you're making into a PDF using <h1>...<h6> tags to generate bookmarks, something like:
html = """
<h1>A Title</h1>
<p>Some content</p>
<h2>A subtitle</h2>
<p>Some more content</p>
"""
document = HTML(string=html).render()
  1. To generate the table of contents string, use the code slightly modified from this link (from the WeasyPrint website!) - search for "Print the outline of the document.":
def generate_outline_str(bookmarks, indent=0):
    outline_str = ""
    for i, (label, (page, _, _), children) in enumerate(bookmarks, 1):
        outline_str += ('%s%d. %s (page %d)' % (
            ' ' * indent, i, label.lstrip('0123456789. '), page))
        outline_str += generate_outline_str(children, indent + 2)
    return outline_str
table_of_contents_string = generate_outline_str(document.make_bookmark_tree())

Note: you'll definitely have to modify this code to format the table of contents in HTML nicely

  1. generate a table of contents page in a new document
table_of_contents_document = HTML(string=table_of_contents_string).render()
table_of_contents_page = table_of_contents_document.pages[0]
  1. insert the table of contents page into the original document
document.pages.insert(0, table_of_contents_page)
  1. write your PDF
document.write_pdf(target='myfile.pdf')

QED

@LegoStormtroopr
Copy link

@doronhorwitz You are a genius - that little recipe needs to be included in the WeasyPrint documentation, being unable to have tables of contents was one of the reasons I wasn't using this library

@benjaoming
Copy link

@doronhorwitz how does this approach handle page numbers, say when the table of content is 1, 2, 3 pages long? Can it break the table into several pages?

@LegoStormtroopr
Copy link

From what I can tell from the code its pretty boiler plate, leaving the actual construction of the "Table of Contents" up to who ever is using it. In this case, if it is HTML there are no breaks so its a big run on blob of text.

Changing the outline_str like this:

    outline_str += ('<div>%s%d. %s <span style="float:right">(page %d)</span></div>' % (
        ' ' * indent, i, label.lstrip('0123456789. '), page))

The above gets page numbers floated on the right (like most tables of contents), the rest of the styling is left as an exercise for the reader, but it would likely be done in a template rather than in code and is indicative of how it would work.

But, the approach I am taking is this:

title_pages = HTML(string=table_of_contents_string).render()
table_of_contents_document = HTML(string=table_of_contents_string).render()

# Number the pages of the main document (eg. using CSS counters)
document = HTML(string=MAIN_DOCUMENTtable_of_contents_string).render()

document.pages.insert(0, title_pages)
document.pages.insert(1, table_of_contents_page)

document.write_pdf(target='myfile.pdf')

Its important to note the "main part of the document" will have the pages number normally, with the title pages and table of contents outside that main flow (much like a regular document anyway), so regardless of how big the table of contents gets its flow doesn't mess the page count of the "actual document".

@mb21
Copy link

mb21 commented Sep 18, 2017

From what I gather in this issue, the problem is that you'd have to make a second pass to insert the page numbers in the TOC after the rest of the document is laid out. Could we then make the simplified assumption that the page numbers are absolutely positioned (or similar) and thus don't affect the position of the following boxes so we wouldn't have to trigger a reflow? It's not perfect, but it would be very useful... so we wouldn't have to make another layout pass, but simply write some code to insert the (absolutely-positioned) page numbers after all the other layouting is done.

btw, here is some HTML/CSS to demonstrate how target-counter(attr(href), page) would work to create a TOC.

@Tontyna
Copy link
Contributor

Tontyna commented Feb 8, 2018

After diving into WeasyPrint's code for several days I finally brought target-counter() into life. It resolves targets within the limits of current WeasyPrint:

  1. No counters in @page-context except the fix coded pageand pages, see also counter-increment in '@page'?  #289.

  2. page and pages aren't available in document context:

    p::before(
       content: "this paragraph is on page " counter(page);
    }

    always yields 0 like any other undefined counter.

  3. Exchange of counter values only via string-set in document elements and string() in @page-content css.

Conclusion:
The target-counter() doesn't solve the TOC-with-page-numbers problem.
Unless somebody redesigns Weasyprint to make at leas N°2 of the above list possible.

At the moment my implementaion of target-counter is still too debuggy to make a pull-request -- lots of comments and debug-statements that helped me understand how WeasyPrint works, oh my! Python is not my favourite language! But in the next days I'll clean up my code...

@liZe
Copy link
Member

liZe commented Aug 7, 2018

@Tontyna Fixing #652 also fixes this issue too, doesn't it?

Of course, the original problem is not really solved, as we can't create a TOC in pure CSS. As there's nothing in the spec allowing such a feature, we may just close this issue and add @doronhorwitz's reciepe to the documentation.

@Tontyna
Copy link
Contributor

Tontyna commented Aug 7, 2018

You're right. A TOC requires a script, @doronhorwitz 's is a good starting point. Although the page numbers wont be the right ones when counter-increment/-reset functions are used in the document. And it doesn't require neither target-* nor #652 at all.

With #652 available I'd automate my TOCs with a script that extracts the headings / bookmark-labels and injects a html-snippet like the <ol class="toc"> in the comment.

@liZe
Copy link
Member

liZe commented Aug 8, 2018

With #652 available I'd automate my TOCs with a script that extracts the headings / bookmark-labels and injects a html-snippet like the

    in the comment.

Indeed.

@liZe liZe added this to the 43 milestone Oct 11, 2018
@liZe
Copy link
Member

liZe commented Oct 11, 2018

I'm closing this issue, as there's nothing more we can do here according to the current CSS specifications. The HTML template engine has to add empty links, see the report sample as an example.

@liZe liZe closed this as completed Oct 11, 2018
@sudarshang
Copy link

@liZe could you please link to the code that generates the pdf in the report sample. From #652 it is not clear if @doronhorwitz 's script is no longer required.

@liZe
Copy link
Member

liZe commented Oct 17, 2018

@liZe could you please link to the code that generates the pdf in the report sample.

I just called weasyprint report.html report.pdf from the report folder.

@brokenhoax
Copy link

This is possible and I got it working in my project. Just look at the "report" sample on WeasyPrint's site and you'll see how you can get Page Number and add it to your own, custom, table of contents:

https://weasyprint.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported
Projects
None yet
Development

No branches or pull requests