Document.select() behaves weirdly in some particular kind of pdf files #3705

urvisism · 2024-07-19T12:52:08Z

Description of the bug

Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Here is my code.

import os
import pathlib

import fitz


def get_all_page_from_pdf(document, last_page=None):
    if last_page:
        document.select(list(range(0, last_page)))
    if document.page_count > 30:
        document.select(list(range(0, 30)))
    return iter(page for page in document)


path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix

read_file = open(path, "rb")
file_data = read_file.read()

doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)

for i, page in enumerate(get_all_page_from_pdf(doc)):
    text = page.get_text()
    print(i, text)

How to reproduce the bug

You can reproduce the Bug/issue by running the given script and attached pdf file.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-07-19T14:23:56Z

The motivation behind your approach is unclear to me.
The .select() method modifies the document ... in quite a complex way.
If you indeed just want to restrict the number of pages from which to extract things, this is like using a sledgehammer to crack a nut.

If the reason is to just limit the number of pages use a different way of doing this.

text = chr(12).join([page.get_text() for page in doc if page.number < 30])
pathlib.Path("out.txt").write_bytes(text.encode())

I do however notice a bug in the base library which in fact yields a PDF from which text can no longer be extracted - as you describe.
I will submit a bug and report the corresponding tracking number here.

JorjMcKie · 2024-07-19T14:32:03Z

Text from sub-selected out.pdf:
mutool-30.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707890

urvisism · 2024-07-19T14:38:14Z

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time.
However, Thanks for suggesting a different way of doing it.
Cheers.

JorjMcKie · 2024-07-19T14:43:39Z

Ok, I see.
But especially if your motivation is saving time, using .select() is a really bad idea - because it does so many things:

create a new table of contents taking the deleted pages into account.
inspect all remaining pages for links to now deleted pages.
build new object table (xref table)
...

JorjMcKie · 2024-07-19T18:46:01Z

Just as an intermediate information:
The MuPDF team has already developed a solution. The fix should be part of one of the next releases.

JorjMcKie · 2024-07-19T18:48:56Z

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time. However, Thanks for suggesting a different way of doing it. Cheers.

Probably the approach with the best performance is this:

text = ""
for page in doc:
    if page.number >= 30:  # leave the iterator immediately
        break
    text += page.get_text()

# etc.

urvisism · 2024-07-20T05:48:24Z

Thank you, Jorj.

julian-smith-artifex-com · 2024-09-02T16:42:51Z

Fixed in 1.24.10.

urvisism changed the title ~~doc.select() behaves weirdly in some particular type of pdf files~~ doc.select() behaves weirdly in some particular kind of pdf files Jul 19, 2024

urvisism changed the title ~~doc.select() behaves weirdly in some particular kind of pdf files~~ Document.select() behaves weirdly in some particular kind of pdf files Jul 19, 2024

JorjMcKie added the upstream bug bug outside this package label Jul 19, 2024

JorjMcKie added the fix developed release schedule to be determined label Jul 19, 2024

julian-smith-artifex-com closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document.select() behaves weirdly in some particular kind of pdf files #3705

Document.select() behaves weirdly in some particular kind of pdf files #3705

urvisism commented Jul 19, 2024 •

edited

Loading

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

urvisism commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

urvisism commented Jul 20, 2024

julian-smith-artifex-com commented Sep 2, 2024

Document.select() behaves weirdly in some particular kind of pdf files #3705

Document.select() behaves weirdly in some particular kind of pdf files #3705

Comments

urvisism commented Jul 19, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

urvisism commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

JorjMcKie commented Jul 19, 2024

urvisism commented Jul 20, 2024

julian-smith-artifex-com commented Sep 2, 2024

urvisism commented Jul 19, 2024 •

edited

Loading