UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

DL6ER · 2022-08-28T13:37:21Z

See #1269 for further details, this reports another issue I've come accross.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("2007,ASurveyofImageClassificationBasedTechniques.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  full_content = " ".join([page.extractText() for page in pdfreader.pages])

PDF used above: 2007,ASurveyofImageClassificationBasedTechniques.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1146, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 22, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 185, in parse_to_unicode
    process_rg, process_char = process_cm_line(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 247, in process_cm_line
    parse_bfchar(l, map_dict, int_entry)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 292, in parse_bfchar
    map_to = unhexlify(lst[1]).decode(
  File "/usr/lib/python3.8/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data

The PDF can be read using a normal PDF viewer.

This may be related to #969 (comment)

The text was updated successfully, but these errors were encountered:

fixes py-pdf#1293

pubpub-zz · 2022-09-01T11:25:59Z

bfchar section uses 2 digit codes instead of 4,
PR fixes the issue

MartinThoma · 2022-09-02T05:57:00Z

The fix is in main and will be released on Sunday to PyPI 🎉 Very nice work everybody 🙌

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Aug 28, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 1, 2022

ROB : cope with 2 digit codes in bfchar

40df2fd

fixes py-pdf#1293

pubpub-zz mentioned this issue Sep 1, 2022

ROB: Cope with 2 digit codes in bfchar #1310

Merged

MartinThoma closed this as completed in #1310 Sep 2, 2022

MartinThoma closed this as completed in 1e089c0 Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

DL6ER commented Aug 28, 2022

pubpub-zz commented Sep 1, 2022

MartinThoma commented Sep 2, 2022

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

Comments

DL6ER commented Aug 28, 2022

Environment

Code + PDF

Traceback

pubpub-zz commented Sep 1, 2022

MartinThoma commented Sep 2, 2022