Linebreak inserted between each letter #3650

rezemika · 2024-07-02T14:23:21Z

Description of the bug

Hey, thank you so much for this amazing tool!

I am using PyMuPDF to parse many official french documents, they contain a cover, a table of contents, and pages of scanned content. The vast majority of them is read with no problem, but for a small number of them, a linebreak is inserted between each letter of the content, making it almost unreadable.

Here are links to a few documents where this happens:

How to reproduce the bug

For instance, here is an example with the second mentioned document:

>>> import pymupdf
>>> f = "2023-04-28-ee04e9ccb016e7806a7cf92a48155834.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[0].get_text("blocks")
[
    (164.6999969482422, 377.63739013671875, 436.3139953613281, 394.6753845214844, 'R\nE\nC\nU\nE\nI\nL\n \nD\nE\nS\n \nA\nC\nT\nE\nS\n \nA\nD\nMI\nN\nI\nS\nT\nR\nA\nT\nI\nF\nS\n', 0, 0),
    (225.0, 531.0374145507812, 376.00396728515625, 548.0614013671875, 'n\n°\n \n7\n7\n \nd\nu\n \n2\n8\n \na\nv\nr\ni\nl\n \n2\n0\n2\n3\n', 1, 0)
]

>>> pymupdf.version
('1.24.7', '1.24.4', '20240626000001')

And here is its first page as I see it:

Please let me know if I can provide any further information!

PS: Is there any "debugging tool" that would allow you to view text and content blocks as they're seen by PyMuPDF for easier analysis?

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

JorjMcKie · 2024-07-02T15:28:10Z

This is a MuPDF problem which I will transfer to their issue system.
test.pdf

MuPDF issue link: https://bugs.ghostscript.com/show_bug.cgi?id=707859

julian-smith-artifex-com · 2024-09-02T16:42:24Z

Fixed in 1.24.10.

JorjMcKie added the upstream bug bug outside this package label Jul 2, 2024

rezemika mentioned this issue Jul 3, 2024

Lines of text are sometimes split into two #3653

Closed

julian-smith-artifex-com closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linebreak inserted between each letter #3650

Linebreak inserted between each letter #3650

rezemika commented Jul 2, 2024 •

edited

Loading

JorjMcKie commented Jul 2, 2024

julian-smith-artifex-com commented Sep 2, 2024

Linebreak inserted between each letter #3650

Linebreak inserted between each letter #3650

Comments

rezemika commented Jul 2, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Jul 2, 2024

julian-smith-artifex-com commented Sep 2, 2024

rezemika commented Jul 2, 2024 •

edited

Loading