Impossible to decode XFormObject ... #1269

DL6ER · 2022-08-25T15:27:32Z

I'm trying to convert a lot of random PDFs found on the web to pure-text for further analysis of potential statistical abnormalities.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("p.a. trento.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  full_content = " ".join([page.extractText() for page in pdfreader.pages])

PDF used above: p.a. trento.pdf
Another example: qqplots.pdf

I can search for further files with errors, if needed (the two examples above are both plot files). I will obviously participate in testing and verifying any proposed bugfixes.

Traceback

There is no crash, however, these are 4164 warnings like

 impossible to decode XFormObject /M0
[...]
 impossible to decode XFormObject /M3
[...]
 impossible to decode XFormObject /M5
[...]
 impossible to decode XFormObject /F1-DejaVuSans-minus

What do I expect?

I'd like to just get the text without flooding my log file with warnings (during a simple test on a few hundred files, the log file grew into the Gigabytes).

The text was updated successfully, but these errors were encountered:

fix py-pdf#1272 (in text) and py-pdf#1269 (in Xform)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 25, 2022

ROB : fix errors/warnings on no /resources with extract_text

755023d

fix py-pdf#1272 (in text) and py-pdf#1269 (in Xform)

pubpub-zz mentioned this issue Aug 25, 2022

ROB: Fix errors/warnings on no /Resources within extract_text #1276

Merged

This was referenced Aug 26, 2022

IndexError: list index out of range #1278

Closed

TypeError: 'NoneType' object is not iterable #1279

Closed

MartinThoma closed this as completed in af9c01b Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to decode XFormObject ... #1269

Impossible to decode XFormObject ... #1269

DL6ER commented Aug 25, 2022

Impossible to decode XFormObject ... #1269

Impossible to decode XFormObject ... #1269

Comments

DL6ER commented Aug 25, 2022

Environment

Code + PDF

Traceback

What do I expect?