PdfReader - Extract images from specific pages #2536

FrsECM · 2024-03-23T18:40:31Z

FrsECM
Mar 23, 2024

Replace this: What happened? What were you trying to achieve?

Environment

Python 3.8
WSL Ubuntu 22.04
Windows11
pypdf 4.1.0

Issue

I generated a very simple pdf with libreoffice-writer :
test_image.pdf

In this pdf, there is two pages, one containing a small text, another containing an image.

I want to extract pdf pages and get the image only in the second page.

The code to reproduce the issue is here :

pdf = PdfReader(pdf_file)
for i,page in enumerate(pdf.pages):
        print(f'--- Extracting page {i}')
        print(page.extract_text())
        print(len(page.images))

The result is bellow :

--- Extracting page 0
Test page 1
1
--- Extracting page 1
1

I expect that on page 0 there is 0 image in order to extract the image only from the second page.
I don't know if it is a normal behaviour.

How to do what i would like to obtain ?
Thanks,
Regards

Answered by pubpub-zz

Mar 23, 2024

Your pdf has the image attached to both pages:

pypdf do not check if the images are "called" in the image content.

View full answer

pubpub-zz · 2024-03-23T19:29:07Z

pubpub-zz
Mar 23, 2024
Maintainer

Your pdf has the image attached to both pages:

pypdf do not check if the images are "called" in the image content.

0 replies

FrsECM · 2024-03-23T19:45:57Z

FrsECM
Mar 23, 2024
Author

I don’t know well how works pdf under the hood.
What is the name of the GUI tool you are using ?

Can we get this ownership information from somewhere when we create the « PageObject » ?

2 replies

stefan6419846 Mar 23, 2024
Maintainer

The shown tool is Apache PDFBox Debugger standalone: https://pdfbox.apache.org/download.html

Can we get this ownership information from somewhere when we create the « PageObject » ?

What do you mean by this? You already have the ownership information if you use page.images.

pubpub-zz Mar 23, 2024
Maintainer

the tool I've used is PdfBox in debug mode: https://pdfbox.apache.org/download.cgi
if you mean to know if the Page content the image. You have to look in the contents for an operation Do where the parameter is the key of the resource (here /Im7)

FrsECM · 2024-03-23T20:34:14Z

FrsECM
Mar 23, 2024
Author

I mean that the ressources seems to be shared between all Pages like if it was a reference. Ownership may not be the right term.

I saw with a GUI tool that in the first page, there is nothing in the stream concerning the image, there is only BT and ET, corresponding to the text.

But on the second one, we can see in the stream that there is the image :

Maybe we can use the content to confirm if the image is owned by the PageObject in that case ?

It would allow to filter the images property before returning it.

2 replies

FrsECM Mar 23, 2024
Author

the tool I've used is PdfBox in debug mode: https://pdfbox.apache.org/download.cgi if you mean to know if the Page content the image. You have to look in the contents for an operation Do where the parameter is the key of the resource (here /Im7)

I saw your answer just after my publication.

Do you think it is releavant to modify the default behaviour of the library regarding that point ?

I can do a proposal regarding that point in a PR.

pubpub-zz Mar 24, 2024
Maintainer

I saw your answer just after my publication.

Do you think it is releavant to modify the default behaviour of the library regarding that point ?

I would not modify the current extraction as there could be other reasons for an image to not be displayed(already met): If an image is out of the page it will be present in the get_contents()

You can propose a function in pages is_image_displayed(image_id:str). at least consider:

images called in contents or in XObjects/Forms within the page
image intersect with the cropped area

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfReader - Extract images from specific pages #2536

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PdfReader - Extract images from specific pages #2536

FrsECM Mar 23, 2024

Environment

Issue

Replies: 3 comments · 4 replies

pubpub-zz Mar 23, 2024 Maintainer

FrsECM Mar 23, 2024 Author

stefan6419846 Mar 23, 2024 Maintainer

pubpub-zz Mar 23, 2024 Maintainer

FrsECM Mar 23, 2024 Author

FrsECM Mar 23, 2024 Author

pubpub-zz Mar 24, 2024 Maintainer

FrsECM
Mar 23, 2024

Replies: 3 comments 4 replies

pubpub-zz
Mar 23, 2024
Maintainer

FrsECM
Mar 23, 2024
Author

stefan6419846 Mar 23, 2024
Maintainer

pubpub-zz Mar 23, 2024
Maintainer

FrsECM
Mar 23, 2024
Author

FrsECM Mar 23, 2024
Author

pubpub-zz Mar 24, 2024
Maintainer