Skip to content

How to Extract Images from a PDF

Jorj X. McKie edited this page Jun 1, 2018 · 17 revisions

You can extract and save all images from a PDF as PNG files on a page-by-page basis with this little script. If an image has a CMYK colorspace, it will be converted to RGB first.

doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

This runs very fast: it takes less than 2 seconds to extract the 180 images of Adobe's manual on a 4.0 GHz desktop PC. This is a PDF with 1'310 pages, 30+ MB size and 330,000+ PDF objects.

A more advanced version of the script is also contained in the demo directory. The major difference is its complete support for images containing masks.

Notes

  1. The script relies on the PDF's structural health. It will e.g. not work, if the document's page tree is damaged.
  2. If images are referenced by multiple pages, they will of course be extracted more than once. Use hashlib to check if a pixmap has already been written (e.g. via the MD5 code of pix.samples).

There is another image extractor in the demo directory, which scans all PDF objects (ignoring pages). It will extract images only once and recover from many PDF structure problems. It also contains logic to skip "insignificant" images (like being too small, or just unicolor, etc.).

Yet other ways to do this type of thing include

  • extract images from pages of any document type using Page.getText("dict"): extract-img4.py
  • extract images from PDF documents respecting their original image format (i.e. tiff, jpeg, bmp, etc.): extract-img3.py
Clone this wiki locally