Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linebreak inserted between each letter #3650

Closed
rezemika opened this issue Jul 2, 2024 · 2 comments
Closed

Linebreak inserted between each letter #3650

rezemika opened this issue Jul 2, 2024 · 2 comments
Labels
upstream bug bug outside this package

Comments

@rezemika
Copy link

rezemika commented Jul 2, 2024

Description of the bug

Hey, thank you so much for this amazing tool!

I am using PyMuPDF to parse many official french documents, they contain a cover, a table of contents, and pages of scanned content. The vast majority of them is read with no problem, but for a small number of them, a linebreak is inserted between each letter of the content, making it almost unreadable.

Here are links to a few documents where this happens:

How to reproduce the bug

For instance, here is an example with the second mentioned document:

>>> import pymupdf
>>> f = "2023-04-28-ee04e9ccb016e7806a7cf92a48155834.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[0].get_text("blocks")
[
    (164.6999969482422, 377.63739013671875, 436.3139953613281, 394.6753845214844, 'R\nE\nC\nU\nE\nI\nL\n \nD\nE\nS\n \nA\nC\nT\nE\nS\n \nA\nD\nMI\nN\nI\nS\nT\nR\nA\nT\nI\nF\nS\n', 0, 0),
    (225.0, 531.0374145507812, 376.00396728515625, 548.0614013671875, 'n\n°\n \n7\n7\n \nd\nu\n \n2\n8\n \na\nv\nr\ni\nl\n \n2\n0\n2\n3\n', 1, 0)
]

>>> pymupdf.version
('1.24.7', '1.24.4', '20240626000001')

And here is its first page as I see it:

Cover of the second mentioned document.

Please let me know if I can provide any further information!

PS: Is there any "debugging tool" that would allow you to view text and content blocks as they're seen by PyMuPDF for easier analysis?

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

@JorjMcKie
Copy link
Collaborator

This is a MuPDF problem which I will transfer to their issue system.
test.pdf

MuPDF issue link: https://bugs.ghostscript.com/show_bug.cgi?id=707859

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Jul 2, 2024
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

3 participants