Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: NotImplementedError: not sure how to get colorspace #1315

Open
macdeport opened this issue May 19, 2024 · 2 comments
Open

[Bug]: NotImplementedError: not sure how to get colorspace #1315

macdeport opened this issue May 19, 2024 · 2 comments
Assignees
Labels

Comments

@macdeport
Copy link

macdeport commented May 19, 2024

Describe the bug

Rare error on an Adobe InDesign 18.0 file (Macintosh)

Steps to reproduce

$ocrmypdf -v1 --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt bid.pdf bid_.pdf

Files

bid.pdf

How did you download and install the software?

MacPorts

OCRmyPDF version

ocrmypdf 16.2.0

Relevant log output

ocrmypdf 16.2.0
Running: ['tesseract', '--version']
Found tesseract 5.3.3
Running: ['tesseract', '--version']
Running: ['pngquant', '--version']
Found pngquant 3.0.3
Running: ['jbig2', '--version']
Found jbig2 0.28
Running: ['gs', '--version']
Found gs 10.3.0
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (4):
deu
eng
fra
osd

pikepdf mmap enabled
os.symlink(bid.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin)
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin.pdf)
Gathering info with 1 thread workers
pikepdf mmap enabled

Using Tesseract OpenMP thread limit 1
Start processing 12 pages concurrently
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
    1 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    2 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    3 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    4 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    5 skipping all processing on this page
pikepdf mmap enabled
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    6 skipping all processing on this page
    7 skipping all processing on this page
    8 skipping all processing on this page
    9 skipping all processing on this page
   10 skipping all processing on this page
   11 skipping all processing on this page
   12 skipping all processing on this page
   13 skipping all processing on this page
   14 skipping all processing on this page
   15 skipping all processing on this page
   16 skipping all processing on this page
   17 skipping all processing on this page
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
   18 skipping all processing on this page
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0
    3 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    3 Page rotation: (content, auto) -> page = (0, 0) -> 0
    4 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    4 Page rotation: (content, auto) -> page = (0, 0) -> 0
    5 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    5 Page rotation: (content, auto) -> page = (0, 0) -> 0
    6 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    6 Page rotation: (content, auto) -> page = (0, 0) -> 0
    7 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    7 Page rotation: (content, auto) -> page = (0, 0) -> 0
    8 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    8 Page rotation: (content, auto) -> page = (0, 0) -> 0
    9 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    9 Page rotation: (content, auto) -> page = (0, 0) -> 0
   10 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   10 Page rotation: (content, auto) -> page = (0, 0) -> 0
   11 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   11 Page rotation: (content, auto) -> page = (0, 0) -> 0
   12 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   12 Page rotation: (content, auto) -> page = (0, 0) -> 0
   13 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   13 Page rotation: (content, auto) -> page = (0, 0) -> 0
   14 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   14 Page rotation: (content, auto) -> page = (0, 0) -> 0
   15 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   15 Page rotation: (content, auto) -> page = (0, 0) -> 0
   16 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   16 Page rotation: (content, auto) -> page = (0, 0) -> 0
   17 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   17 Page rotation: (content, auto) -> page = (0, 0) -> 0
   18 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   18 Page rotation: (content, auto) -> page = (0, 0) -> 0

/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/sidecar.txt -> bid.txt
Postprocessing...
Running: ['tesseract', '--version']
xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=197, ext='.jpg')
xref 199: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=199, ext='.jpg')
xref 200: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=200, ext='.jpg')
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 213: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
xref 216: skipping image with small stream size
xref 217: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 220: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 222: skipping image with small stream size
xref 223: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
Optimizable images: JPEGs: 3 PNGs: 0

xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: marking this JPEG as deflatable
xref 199: marking this JPEG as deflatable
xref 200: marking this JPEG as deflatable
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: marking this JPEG as deflatable
xref 216: skipping image with small stream size
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 220: marking this JPEG as deflatable
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: marking this JPEG as deflatable
xref 222: skipping image with small stream size


xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 199: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 200: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 216: skipping image with small stream size
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 222: skipping image with small stream size
Optimizable images: JBIG2 groups: 0

Image optimization did not improve the file - optimizations will not be used
Running: ['jbig2', '--version']
Running: ['pngquant', '--version']
Image optimization ratio: 1.00 savings: 0.0%
Total file size ratio: 1.05 savings: 4.9%
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/optimize.pdf -> bid_.pdf
Corrupt JPEG data: 1 extraneous bytes before marker 0xd9
@jbarlow83
Copy link
Collaborator

Most of these errors are harmless and mainly says that a particular image cannot be optimized because it's defined in terms of production printing (e.g. CMYK+) rather than RGB. Of course, it would be cleaner to log this fact, instead of logging an exception. I will have to make that change.

The error message at the end
Corrupt JPEG data: 1 extraneous bytes before marker 0xd9
suggests that there is some corruption in the PDF - I'd check it with a viewer to ensure all images look fine visually.

@user1823
Copy link

I also got a similar error (actually, the same error thousands of times in the same PDF):

xref 12157: While extracting this image, an error occurred                                               optimize.py:327
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\ocrmypdf\optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\ocrmypdf\optimize.py", line 215, in
extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pikepdf\models\image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK',
pikepdf.Dictionary({
  "/C0": [ 0, 0, 0, 0 ],
  "/C1": [ 0, 0, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/FunctionType": 2,
  "/N": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ]
})]

Glad to hear that it is harmless. Hoping for a change to make this less scary.

jbarlow83 added a commit that referenced this issue Jun 30, 2024
Fixes [Bug]: NotImplementedError: not sure how to get colorspace #1315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants