How to remove the image-with-text from the PDF #1418
Replies: 3 comments
-
This is something I'm looking for as well, essentially need a way to remove portion of the background image where some kind of text was recognized. |
Beta Was this translation helpful? Give feedback.
-
You can use Ghostscript to regenerate the PDF, suppressing images:
As you say, this removals all images. Correlating the image that produced OCR text to an image in a document is difficult -- we render all content on each page as a whole image and send it for OCR. Intelligently removing the text from that image is even more difficult. I expect it would be easier to use some sort of commercial OCR that can reconstruct the document as say, a Word document, and then perform translation there. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.
But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:
The processed image text from the original PDF is not removed, which makes sense (how would Google know?).
Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.
Regards!
Beta Was this translation helpful? Give feedback.
All reactions