Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document.select() behaves weirdly in some particular kind of pdf files #3705

Closed
urvisism opened this issue Jul 19, 2024 · 8 comments
Closed
Labels
fix developed release schedule to be determined upstream bug bug outside this package

Comments

@urvisism
Copy link

urvisism commented Jul 19, 2024

Description of the bug

Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Here is my code.

import os
import pathlib

import fitz


def get_all_page_from_pdf(document, last_page=None):
    if last_page:
        document.select(list(range(0, last_page)))
    if document.page_count > 30:
        document.select(list(range(0, 30)))
    return iter(page for page in document)


path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix

read_file = open(path, "rb")
file_data = read_file.read()

doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)

for i, page in enumerate(get_all_page_from_pdf(doc)):
    text = page.get_text()
    print(i, text)

How to reproduce the bug

You can reproduce the Bug/issue by running the given script and attached pdf file.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.10

@urvisism urvisism changed the title doc.select() behaves weirdly in some particular type of pdf files doc.select() behaves weirdly in some particular kind of pdf files Jul 19, 2024
@urvisism urvisism changed the title doc.select() behaves weirdly in some particular kind of pdf files Document.select() behaves weirdly in some particular kind of pdf files Jul 19, 2024
@JorjMcKie
Copy link
Collaborator

The motivation behind your approach is unclear to me.
The .select() method modifies the document ... in quite a complex way.
If you indeed just want to restrict the number of pages from which to extract things, this is like using a sledgehammer to crack a nut.

If the reason is to just limit the number of pages use a different way of doing this.

text = chr(12).join([page.get_text() for page in doc if page.number < 30])
pathlib.Path("out.txt").write_bytes(text.encode())

I do however notice a bug in the base library which in fact yields a PDF from which text can no longer be extracted - as you describe.
I will submit a bug and report the corresponding tracking number here.

@JorjMcKie
Copy link
Collaborator

Text from sub-selected out.pdf:
mutool-30.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707890

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Jul 19, 2024
@urvisism
Copy link
Author

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time.
However, Thanks for suggesting a different way of doing it.
Cheers.

@JorjMcKie
Copy link
Collaborator

Ok, I see.
But especially if your motivation is saving time, using .select() is a really bad idea - because it does so many things:

  • create a new table of contents taking the deleted pages into account.
  • inspect all remaining pages for links to now deleted pages.
  • build new object table (xref table)
  • ...

@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Jul 19, 2024
@JorjMcKie
Copy link
Collaborator

Just as an intermediate information:
The MuPDF team has already developed a solution. The fix should be part of one of the next releases.

@JorjMcKie
Copy link
Collaborator

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time. However, Thanks for suggesting a different way of doing it. Cheers.

Probably the approach with the best performance is this:

text = ""
for page in doc:
    if page.number >= 30:  # leave the iterator immediately
        break
    text += page.get_text()

# etc.

@urvisism
Copy link
Author

Thank you, Jorj.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

3 participants