Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding annotations to the PDF to link back its content to its source. #2192

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yanntrividic
Copy link

Hello!

Before anything, a bit of context: this PR is a work in progress, and it is not ready to be merged as such. It will require some more work in order to be eventually added to the main branch, as discussed beforehand with @liZe and @grewn0uille. The idea behind this first draft is to allow WeasyPrint to embed metadata in the PDF for each HTMLElement with an id attribute it converts by adding new \Annot PDF objects that can then be accessed in the PDF readers.

What it allowed me to do for now is this:

ezgif-6-1a414276b9

On the left, you have a webpage; and on the right, you have the PDF produced by this fork of WeasyPrint, previewed with PDF.js. A few event listeners were added to bidirectionally "synchronize" the two visualisations. This is just a proof-of-concept, but from there we basically have what we need to build powerful interfaces that take into account the content of the PDF as semantic data that can be linked back to its source.

We talked about adding a PDF variant for debugging that could be accessible through an option like --pdf-variant debug. For now, nothing has been done in this direction, the code I propose here is just "hardcoded" into the default behaviour of WeasyPrint. I guess it will need some cleanup also, as I'm not sure if I understood the spec totally right.

Anyway, I'd be really interested in working with you on this and going in a direction that suits the philosophy of the project. If you feel like I could be of help, please share your thoughts here so that we can discuss what would be the best way to proceed, and how I could contribute further!

I can also share on demand the code of the interface I'm building, even though it is not ready to be made totally public for now, so don't hesitate to ask :)

Thanks for the great job!

@liZe
Copy link
Member

liZe commented Aug 3, 2024

Hi @yanntrividic!

I’ve just pushed a debug branch that provides a --pdf-variant=debug option. The result is a bit different because I’ve changed the way it works, but I think that the result includes enough data to work. Tell me if anything’s wrong or missing!

@yanntrividic
Copy link
Author

Hello @liZe!

Summer has done its work, I am finally in a situation where I can have a look at this again. Thanks for taking the time to make this seed of an idea into actual code :) it's really nice to study how you integrated it properly with the rest of the code base.

I was able to integrate your branch in my app, I had to change a few lines in my app's code to make it compliant with your new logics, but it worked out easily. I just had to add one line in your proposition; you're right when you say that the result includes enough data to work with, but it would be sufficient only if we were building the PDF renderer ourselves. In my case, I use PDFjs, and PDFjs needs a Dest key to actually render the data in the HTML content, otherwise it is just not there -- the metadata associated with the T key is not present in the HTML code.

To my understanding, the easiest way to pass the id attribute to PDFjs is by turning the annotation into an anchor. Otherwise it is just ignored. Maybe there is another way that I don't know about? We face similar problems with other renderers such as PDFium.

When I try to read through the standard, I don't see many solutions. It's possible to embed an action into an annotation, and it might be a lead for a solution there, but we would face the same issue regarding interoperability in the end.

Any thoughts on this? :)

@yanntrividic
Copy link
Author

On another note, I have a problem with the annotation your code generated after meeting a col element that spans over several pages. The rectangles of all the col elements take the shape and position of the col element of the last page, even though those are different shapes and positions.

Here is an example on one page (on the last page, the shape frames perfectly the element):

Screenshot of the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants