Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotated a pdf and Trying to extract images from the pdf it extracted unrotated pdfs #2700

Closed
Tejareddy94 opened this issue Jun 3, 2024 · 4 comments

Comments

@Tejareddy94
Copy link

Tejareddy94 commented Jun 3, 2024

We have a usecase where pages in pdf are roated we are rotating with flatten rotation using qpdf tool. After that we are trying to extract images from the pdf but it is extracting unrotated images even after using page.transfer_rotation_to_content()

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
 Linux-6.5.0-35-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
 pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:
reader = PdfReader(self.pdf_path)

for page_index, page in enumerate(reader.pages):
    print(page.mediabox.height, page.mediabox.width, page.rotation)
    page.transfer_rotation_to_content()
    for image in page.images:
        file_path = self.output_path.format(page_no=str(page_index))
        file_paths.append(file_path)
        with open(file_path, "wb") as fp:
            fp.write(image.data)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

sv600_c_normal.pdf
The above one is original pdf
The below one is the rotated pdf with qpdf tool

qpdf original_pdf rotated_tmp_file_path --rotate=90 --flatten-rotation

Rotated pdf
2na5UUZDvC7M6ft1YDpsyPvz (copy).pdf

Traceback

So when i try to extract image from rotated pdf it extracted image without rotation instead it would have extracted with rotated image
testinnew-page-0

Can you point out where is the mistake is or i am doing something wrong
Thank you

@stefan6419846
Copy link
Collaborator

The main difference between the different PDF files is that the rotated page uses the 0 -1 1 0 0 597.12 cm definition before inserting the main image, which basically defines the transformation matrix. The image (most likely) is the same in both cases for this reason, thus the output is correct in my opinion.

Slightly related to #2592.

@Tejareddy94
Copy link
Author

Kindly let me know if there is any workaround or solution to extract rotated image?

Or it is not possible to get that rotated image

or what better i can do to get the rotated image

@stefan6419846 stefan6419846 changed the title Rotated a pdf and Trying to extract images form the pdf it extracted unrotated pdfs Rotated a pdf and Trying to extract images from the pdf it extracted unrotated pdfs Jun 3, 2024
@stefan6419846
Copy link
Collaborator

The embedded images have their original rotation, thus pypdf extracts it like this. For your specific example, you might want to retrieve the page rotation and apply this to your extracted image accordingly.

@Tejareddy94
Copy link
Author

okay Thank you @stefan6419846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants