You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a file that has been ocr'ed with ocrmypdf, an annotation (text highlight) gets repeated on every page of the document after saving.
Steps to reproduce
1. Run ocrmypdf on a PDF which is just a bag of images.
2. Add some annotations (text highlights).
3. Use the script [here](https://thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python) to remove the annotations via pyMuPDF.
4. Add another annotation to the resulting file.
5. Close and re-open it to find that last annotation on every page.
I realize there's some other stuff besides ocrmypdf happening there, but if I take a file from, say JSTOR, that has OCR'ed text already and run steps 2-4 on it, I don't get the problem. So it seems like it's something that ocrmypdf is doing to the file that's causing the issue.
One more thing: if I use gs to remove the highlights on the same file via something like `gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -o output.pdf -c "/PreserveAnnotTypes [] def" -c "/ShowAnnotTypes [] def" -f input.pdf` I don't get any weirdness either. I suspect gs is doing a bit more than the pyMuPDF script, too.
Here's a PDF as an example. It's generated from a JSTOR file using magick -units pixelsperinch -density 280 input.pdf -format pdf output.pdf to remove all metadata, etc, and convert the contents to images. I've zipped it in the original version with no OCR applied and in the final version where the highlighting is duplicated. Note that this happens pretty reliably for me with files from other sources.
How did you download and install the software?
Homebrew
OCRmyPDF version
16.5.0
Relevant log output
ocrmypdf 16.5.0 __main__.py:59
Running: ['tesseract', '--version'] __init__.py:133
Found tesseract 5.4.1 __init__.py:343
Running: ['tesseract', '--version'] __init__.py:133
Running: ['tesseract', '--version'] __init__.py:133
Running: ['gs', '--version'] __init__.py:133
Found gs 10.4.0 __init__.py:343
Running: ['gs', '--version'] __init__.py:133
Running: ['tesseract', '--list-langs'] __init__.py:133
stdout/stderr = List of available languages in "/Users/username/Documents/tessdata/" (16): __init__.py:73
deu
deu_frak
ell
eng
enm
fra
grc
ita
ita_old
lat
osd
script/Fraktur
script/Greek
script/Latin
spa
spa_old
pikepdf mmap enabled helpers.py:328
os.symlink(/Users/username/Desktop/AU6P9FDZ/2. magick_2pages.pdf, helpers.py:179
/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/origin)
os.symlink(/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/origin, helpers.py:179
/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/origin.pdf)
Gathering info with 1 thread workers info.py:804
pikepdf mmap enabled helpers.py:328
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 3 tesseract_ocr.py:199
Start processing 2 pages concurrently ocr.py:96
pikepdf mmap enabled helpers.py:328
pikepdf mmap enabled helpers.py:328
1 Rasterize with png16m, rotation 0 _pipeline.py:539
2 Rasterize with png16m, rotation 0 _pipeline.py:539
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', __init__.py:133
'-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r280.000105x280.000105', '-dPDFSTOPONERROR', '-o',
'-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f',
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/origin.pdf']
2 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', __init__.py:133
'-sDEVICE=png16m', '-dFirstPage=2', '-dLastPage=2', '-r280.000105x280.000105', '-dPDFSTOPONERROR', '-o',
'-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f',
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/origin.pdf']
1 Rotating output by 0 ghostscript.py:149
1 resolution (280.0096, 280.0096) _pipeline.py:618
1 Running: ['tesseract', '-l', 'eng', __init__.py:133
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/000001_ocr.png',
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/000001_ocr_hocr', 'hocr', 'txt']
2 resolution (280.0096, 280.0096) _pipeline.py:618
2 Running: ['tesseract', '-l', 'eng', __init__.py:133
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/000002_ocr.png',
'/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/000002_ocr_hocr', 'hocr', 'txt']
2 pikepdf.Matrix(0.257143, 0, 0, -0.257143, 0, 760.114) _hocr.py:203
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 159, 213) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 233, 1234) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1284) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1335) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1386) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 165, 1436) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 165, 1486) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1537) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 163, 1587) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 231, 1637) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1687) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1738) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 165, 1789) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1839) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 164, 1890) _hocr.py:323
2 eng _hocr.py:267
1 pikepdf.Matrix(0.257143, 0, 0, -0.257143, 0, 760.114) _hocr.py:203
2 pikepdf.Matrix(1, 0, 0, 1, 166, 1940) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 320, 200) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 165, 2029) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 366, 282) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 165, 2067) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 564, 421) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 166, 2106) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 543, 701) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2145) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 213, 800) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2184) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 215, 850) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2223) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 901) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2261) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 953) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 167, 2339) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1005) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2378) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1057) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 168, 2456) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1109) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 169, 2494) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 172, 2533) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1161) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 170, 2572) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1212) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 170, 2611) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1265) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 971, 2028) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 240, 1315) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 968, 2066) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1365) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 968, 2143) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1416) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 968, 2182) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1468) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2221) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1519) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2260) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1572) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2299) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1624) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2337) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1675) _hocr.py:323
2 pikepdf.Matrix(0.99996, -0.00899964, 0.00899964, 0.99996, 969, 2377) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2454) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1727) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 970, 2492) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 969, 2531) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 170, 1779) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 970, 2570) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1831) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 972, 2609) _hocr.py:323
2 eng _hocr.py:267
1 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 754, 2762) _hocr.py:323
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1883) _hocr.py:323
2 eng _hocr.py:267
2 pikepdf.Matrix(1, 0, 0, 1, 839, 2794) _hocr.py:323
1 eng _hocr.py:267
2 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 1933) _hocr.py:323
2 pikepdf.Matrix(1, 0, 0, 1, 661, 2825) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 170, 2027) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 2067) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 2107) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 2146) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 169, 2186) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2226) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2266) _hocr.py:323
2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140
2 Grafting _graft.py:251
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2306) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2387) _hocr.py:323
2 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0) _graft.py:294
1 eng _hocr.py:267
2 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2425) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 167, 2463) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 168, 2501) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(0.999988, 0.00499994, -0.00499994, 0.999988, 168, 2644) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2028) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 973, 2068) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 968, 2107) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2148) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2187) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2227) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2268) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 973, 2308) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 970, 2348) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2388) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2426) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 971, 2503) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 755, 2762) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 839, 2794) _hocr.py:323
1 eng _hocr.py:267
1 pikepdf.Matrix(1, 0, 0, 1, 661, 2825) _hocr.py:323
1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140
1 Grafting _graft.py:251
1 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0) _graft.py:294
1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing... ocr.py:144
Running: ['tesseract', '--version'] __init__.py:133
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recursing into Form XObject /OCR-bWPu0wv2KG52KckIrd7dgw in page 0 optimize.py:265
xref 26: skipping image because it is an SMask optimize.py:280
xref 16: treating as an optimization candidate optimize.py:282
Recursing into Form XObject /OCR-BLMEU-cDuckumi1ZbRkhyg in page 1 optimize.py:265
xref 27: skipping image because it is an SMask optimize.py:280
xref 20: treating as an optimization candidate optimize.py:282
XrefExt(xref=16, ext='.png') optimize.py:347
XrefExt(xref=20, ext='.png') optimize.py:347
Optimizable images: JPEGs: 0 PNGs: 2 optimize.py:352
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Recursing into Form XObject /OCR-bWPu0wv2KG52KckIrd7dgw in page 0 optimize.py:265
xref 26: skipping image because it is an SMask optimize.py:280
xref 16: treating as an optimization candidate optimize.py:282
Recursing into Form XObject /OCR-BLMEU-cDuckumi1ZbRkhyg in page 1 optimize.py:265
xref 27: skipping image because it is an SMask optimize.py:280
xref 20: treating as an optimization candidate optimize.py:282
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Recursing into Form XObject /OCR-bWPu0wv2KG52KckIrd7dgw in page 0 optimize.py:265
xref 26: skipping image because it is an SMask optimize.py:280
xref 16: treating as an optimization candidate optimize.py:282
Recursing into Form XObject /OCR-BLMEU-cDuckumi1ZbRkhyg in page 1 optimize.py:265
xref 27: skipping image because it is an SMask optimize.py:280
xref 20: treating as an optimization candidate optimize.py:282
Optimizable images: JBIG2 groups: 0 optimize.py:363
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization did not improve the file - optimizations will not be used optimize.py:720
Running: ['jbig2', '--version'] __init__.py:133
Running: ['pngquant', '--version'] __init__.py:133
Image optimization ratio: 1.00 savings: -0.0% _pipeline.py:989
Total file size ratio: 0.99 savings: -0.7% _pipeline.py:992
/var/folders/yl/xd3tsv2x1959s23ts4k1qt9m0000gr/T/ocrmypdf.io.s_f6cvgn/optimize.pdf -> _pipeline.py:1064
Desktop/AU6P9FDZ/magick_ocr.pdf
The text was updated successfully, but these errors were encountered:
Describe the bug
On a file that has been ocr'ed with ocrmypdf, an annotation (text highlight) gets repeated on every page of the document after saving.
Steps to reproduce
Files
2. magick_2pages.pdf.zip
5. magick_ocr_highlights.pdf.zip
Here's a PDF as an example. It's generated from a JSTOR file using
magick -units pixelsperinch -density 280 input.pdf -format pdf output.pdf
to remove all metadata, etc, and convert the contents to images. I've zipped it in the original version with no OCR applied and in the final version where the highlighting is duplicated. Note that this happens pretty reliably for me with files from other sources.How did you download and install the software?
Homebrew
OCRmyPDF version
16.5.0
Relevant log output
The text was updated successfully, but these errors were encountered: