How to remove the image-with-text from the PDF #1418

SurinameClubcard · 2024-09-08T12:48:40Z

SurinameClubcard
Sep 8, 2024

Hi,

I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.

But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:

The processed image text from the original PDF is not removed, which makes sense (how would Google know?).

Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.

Regards!

0xE1 · 2024-11-02T13:48:25Z

0xE1
Nov 2, 2024

This is something I'm looking for as well, essentially need a way to remove portion of the background image where some kind of text was recognized.

0 replies

jbarlow83 · 2024-11-03T22:49:32Z

jbarlow83
Nov 3, 2024
Maintainer

You can use Ghostscript to regenerate the PDF, suppressing images:

gs -q -dFILTERIMAGE -o out.pdf in.pdf

As you say, this removals all images.

Correlating the image that produced OCR text to an image in a document is difficult -- we render all content on each page as a whole image and send it for OCR. Intelligently removing the text from that image is even more difficult. I expect it would be easier to use some sort of commercial OCR that can reconstruct the document as say, a Word document, and then perform translation there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove the image-with-text from the PDF #1418

{{title}}

Replies: 3 comments

{{title}}

{{title}}

Select a reply

How to remove the image-with-text from the PDF #1418

SurinameClubcard Sep 8, 2024

Replies: 3 comments

0xE1 Nov 2, 2024

jbarlow83 Nov 3, 2024 Maintainer

SurinameClubcard
Sep 8, 2024

0xE1
Nov 2, 2024

jbarlow83
Nov 3, 2024
Maintainer