Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 10: b'' #1270

Closed
DL6ER opened this issue Aug 25, 2022 · 10 comments
Closed

ValueError: invalid literal for int() with base 10: b'' #1270

DL6ER opened this issue Aug 25, 2022 · 10 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness nf-performance Non-functional change: Performance

Comments

@DL6ER
Copy link

DL6ER commented Aug 25, 2022

See #1269 for further details, this reports another issue I've come accross.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("Introduction to Programming Using Python ( PDFDrive ).pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=True)
  metadata = pdfreader.metadata

PDF file used above: Introduction to Programming Using Python ( PDFDrive ).pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    metadata = pdfreader.metadata
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 327, in metadata
    obj = self.trailer[TK.INFO]
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1151, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 822, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 269, in read_from_stream
    value = read_object(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''
@DL6ER
Copy link
Author

DL6ER commented Aug 28, 2022

More files triggering this (sometimes through different paths):

  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1154, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1151, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 327, in metadata
    obj = self.trailer[TK.INFO]
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1151, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 822, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 269, in read_from_stream
    value = read_object(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1146, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 53, in build_char_map
    sp_width = compute_space_width(ft, sp, space_width)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 312, in compute_space_width
    ft1 = ft["/DescendantFonts"][0].get_object()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1151, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 822, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 266, in read_from_stream
    key = read_object(stream, pdf)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1157, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 689, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 719, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1157, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 689, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 719, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b'0,0'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1146, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 17, in build_char_map
    ft: DictionaryObject = obj["/Resources"]["/Font"][font_name]  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1151, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 822, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 269, in read_from_stream
    value = read_object(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
    return NumberObject.read_from_stream(stream)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
    return NumberObject(num)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
    val = int(value)
ValueError: invalid literal for int() with base 10: b''

@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Aug 28, 2022
@MartinThoma
Copy link
Member

Validating the PDF with demo.verapdf.org:

Exception: Couldn't parse stream caused by exception: PDFParser::GetXRefSection(...)can not locate xref table

So this is not a PyPDF2 bug, but a robustness issue. An exception is expected (at least in strict mode), but we could give a better one.

@MartinThoma MartinThoma added the is-robustness-issue From a users perspective, this is about robustness label Aug 28, 2022
@DL6ER
Copy link
Author

DL6ER commented Aug 28, 2022

Thanks for sharing this URL. I tested a few PDFs I posted in my updated comment above and have seen some errors even when the PDFs can be read normally with a PDF viewer.

E.g.

Exception: Caught unexpected runtime exception during validation caused by exception: Wrapped org.verapdf.exceptions.VeraPDFParserException: Error while parsing object : 10 0 caused by exception: Error while parsing object : 10 0 caused by exception: PDFParser::GetDictionary()invalid pdf dictonary

for Effective Java 3rd Edition by Joshua Bloch.pdf

@pubpub-zz
Copy link
Collaborator

this is fixed by PR #1315.
This PR can not be closed

@MartinThoma
Copy link
Member

@pubpub-zz I guess you mean that this issue (#1270) can be closed?

@MartinThoma
Copy link
Member

I've just noticed that we now call logger_warning(f"FloatObject ({value}) invalid; use 0.0 instead", __name__) a lot of times. Actually so much that I think it affects performance.

@MartinThoma
Copy link
Member

Oh wow, even when commenting the log lines out executing the sample takes REALLY long! Maybe I should add this to one of the performance tickets

@pubpub-zz
Copy link
Collaborator

I would propose you to create a new ticket to only focus on the performance issue.
output to stderr may be the issue

@pubpub-zz
Copy link
Collaborator

+1?

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Sep 7, 2022
@MartinThoma
Copy link
Member

I've moved that to #1329 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness nf-performance Non-functional change: Performance
Projects
None yet
Development

No branches or pull requests

3 participants