Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Run OCR on images in PDFs to extract text #20

Open
gwillcox-r7 opened this issue Mar 15, 2023 · 4 comments
Open

[Feature request] Run OCR on images in PDFs to extract text #20

gwillcox-r7 opened this issue Mar 15, 2023 · 4 comments

Comments

@gwillcox-r7
Copy link

Is your feature request related to a problem? Please describe.
Would be nice to have the ability to extract text from images embedded in PDFs.

Describe the solution you'd like
Ability to extract text from images in PDFs, such as if the PDF is a slide deck of images. This might be something we could configure with a toggle switch or a list so that this isn't run by default, since it will likely be computationally expensive to do both text extraction as well as OCR.

Describe alternatives you've considered
https://evermap.com/Tutorial_ABM_OCR.asp describes a way to make OCR documents with Adobe Acrobat. I believe you can also do this with tools like Readiris that OCR in multiple languages.

Additional context
Some PDFs may contain diagrams or other images with text in them that can be useful to extract. We already have OCR support for images so it may be an idea to extract the images from the PDF and run OCR on them, then combine this with the existing text extraction results.

@gwillcox-r7 gwillcox-r7 changed the title [Feature request] Extract data from Images in PDFs [Feature request] Run OCR on images in PDFs to extract text Mar 15, 2023
@khesed
Copy link

khesed commented Aug 2, 2023

Another alternative could be converting the PDF into an image and running OCR on that.

@scambier
Copy link
Owner

scambier commented Aug 2, 2023

This is the ideal solution, as it would greatly improve pdf text extraction. The problem is that it looks really hard to do in a pure js/wasm context without external dependencies. The only robust solution I've found is pdf.js, but it scales awfully and eats all ram after a few files. Its probably worth it to try again though.

@khesed
Copy link

khesed commented Aug 5, 2023

This is the ideal solution, as it would greatly improve pdf text extraction. The problem is that it looks really hard to do in a pure js/wasm context without external dependencies. The only robust solution I've found is pdf.js, but it scales awfully and eats all ram after a few files. Its probably worth it to try again though.

Played around with PDF.js and it works really well, in my opinion. It might not scale like a dream, but I think that it might not be feasible for operations like this to scale anyway. For sake of memory, prioritizing small files in the vault first and only doing files 1 at a time after a certain size should be satisfactory (I imagine).

@khesed
Copy link

khesed commented Aug 6, 2023

Integrating imagemagick could work solve this:
#21 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants