Adding annotations to the PDF to link back its content to its source. #2192

yanntrividic · 2024-06-24T15:14:20Z

Hello!

Before anything, a bit of context: this PR is a work in progress, and it is not ready to be merged as such. It will require some more work in order to be eventually added to the main branch, as discussed beforehand with @liZe and @grewn0uille. The idea behind this first draft is to allow WeasyPrint to embed metadata in the PDF for each HTMLElement with an id attribute it converts by adding new \Annot PDF objects that can then be accessed in the PDF readers.

What it allowed me to do for now is this:

On the left, you have a webpage; and on the right, you have the PDF produced by this fork of WeasyPrint, previewed with PDF.js. A few event listeners were added to bidirectionally "synchronize" the two visualisations. This is just a proof-of-concept, but from there we basically have what we need to build powerful interfaces that take into account the content of the PDF as semantic data that can be linked back to its source.

We talked about adding a PDF variant for debugging that could be accessible through an option like --pdf-variant debug. For now, nothing has been done in this direction, the code I propose here is just "hardcoded" into the default behaviour of WeasyPrint. I guess it will need some cleanup also, as I'm not sure if I understood the spec totally right.

Anyway, I'd be really interested in working with you on this and going in a direction that suits the philosophy of the project. If you feel like I could be of help, please share your thoughts here so that we can discuss what would be the best way to proceed, and how I could contribute further!

I can also share on demand the code of the interface I'm building, even though it is not ready to be made totally public for now, so don't hesitate to ask :)

Thanks for the great job!

… id attribute.

liZe · 2024-08-03T12:01:03Z

Hi @yanntrividic!

I’ve just pushed a debug branch that provides a --pdf-variant=debug option. The result is a bit different because I’ve changed the way it works, but I think that the result includes enough data to work. Tell me if anything’s wrong or missing!

yanntrividic · 2024-08-30T13:59:34Z

Hello @liZe!

Summer has done its work, I am finally in a situation where I can have a look at this again. Thanks for taking the time to make this seed of an idea into actual code :) it's really nice to study how you integrated it properly with the rest of the code base.

I was able to integrate your branch in my app, I had to change a few lines in my app's code to make it compliant with your new logics, but it worked out easily. I just had to add one line in your proposition; you're right when you say that the result includes enough data to work with, but it would be sufficient only if we were building the PDF renderer ourselves. In my case, I use PDFjs, and PDFjs needs a Dest key to actually render the data in the HTML content, otherwise it is just not there -- the metadata associated with the T key is not present in the HTML code.

To my understanding, the easiest way to pass the id attribute to PDFjs is by turning the annotation into an anchor. Otherwise it is just ignored. Maybe there is another way that I don't know about? We face similar problems with other renderers such as PDFium.

When I try to read through the standard, I don't see many solutions. It's possible to embed an action into an annotation, and it might be a lead for a solution there, but we would face the same issue regarding interoperability in the end.

Any thoughts on this? :)

yanntrividic · 2024-08-30T14:05:06Z

On another note, I have a problem with the annotation your code generated after meeting a col element that spans over several pages. The rectangles of all the col elements take the shape and position of the col element of the last page, even though those are different shapes and positions.

Here is an example on one page (on the last page, the shape frames perfectly the element):

Yann Trividic added 2 commits June 24, 2024 16:01

WeasyPrint now produces a LinkAnnotation for each HTMLElement with an…

7ecba8c

… id attribute.

ruff checks passed!

8b259fa

yanntrividic added 2 commits August 30, 2024 15:03

@liZe's code has been integrated into the PR

96df4fe

Dest key added to the debug annotations

5c7a6dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding annotations to the PDF to link back its content to its source. #2192

Adding annotations to the PDF to link back its content to its source. #2192

yanntrividic commented Jun 24, 2024

liZe commented Aug 3, 2024

yanntrividic commented Aug 30, 2024

yanntrividic commented Aug 30, 2024

Adding annotations to the PDF to link back its content to its source. #2192

Are you sure you want to change the base?

Adding annotations to the PDF to link back its content to its source. #2192

Conversation

yanntrividic commented Jun 24, 2024

liZe commented Aug 3, 2024

yanntrividic commented Aug 30, 2024

yanntrividic commented Aug 30, 2024