Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] 使用 Python OCR 将 PDF 转换成文本内容 #15

Open
yangruihan opened this issue Oct 10, 2020 · 0 comments
Open

[Python] 使用 Python OCR 将 PDF 转换成文本内容 #15

yangruihan opened this issue Oct 10, 2020 · 0 comments

Comments

@yangruihan
Copy link
Owner

使用 Python OCR 将 PDF 转换成文本内容

测试平台

系统:macOS 10.14.6
Python:Python 3.8.5

准备工作

  • 安装 tesseractbrew install tesseract

  • 安装 popplerbrew install poppler

  • 安装 pytesseractpip3 install pytesseract

  • 安装 pdf2imagepip3 install pdf2image

  • 安装 numpypip3 install numpy

  • 安装pillowpip3 install pillow

代码

import numpy as np
import pytesseract
from pdf2image import convert_from_path
import time

def pdf_ocr(fname, **kwargs):
    """
    将pdf通过ocr转换成文本
    fname: pdf 路径 (string)
    kwargs: 打开 pdf 的各种参数
    """

    # 将 pdf 转换成图片
    images = convert_from_path(fname, **kwargs)
    
    # 结果保存在此变量中
    text = ''

    images_cnt = len(images)
    sum_time = 0
    
    for i, img in enumerate(images):
        # 计算识别耗时
        print(f'start {i + 1} / {images_cnt}...')
        start_time = time.time()

        img = np.array(img)

        # 识别图片中的文本
        text += pytesseract.image_to_string(img, lang='eng+chi')

        # 打印识别耗时        
        end_time = time.time()
        print(f'done {i + 1} / {images_cnt} use time: {end_time - start_time}\n')
        sum_time += end_time - start_time

    print(f'sum use time: {sum_time}')
    return text

fname = 'test.pdf'
text = pdf_ocr(fname)

# 将结果写入到文件中
with open('result.txt', 'w') as f:
    f.write(text)
@yangruihan yangruihan changed the title 使用 Python OCR 将 PDF 转换成文本内容 [Python] 使用 Python OCR 将 PDF 转换成文本内容 Dec 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant