We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
系统:macOS 10.14.6 Python:Python 3.8.5
安装 tesseract:brew install tesseract
brew install tesseract
安装 poppler:brew install poppler
brew install poppler
安装 pytesseract:pip3 install pytesseract
pip3 install pytesseract
安装 pdf2image:pip3 install pdf2image
pip3 install pdf2image
安装 numpy:pip3 install numpy
pip3 install numpy
安装pillow:pip3 install pillow
pip3 install pillow
import numpy as np import pytesseract from pdf2image import convert_from_path import time def pdf_ocr(fname, **kwargs): """ 将pdf通过ocr转换成文本 fname: pdf 路径 (string) kwargs: 打开 pdf 的各种参数 """ # 将 pdf 转换成图片 images = convert_from_path(fname, **kwargs) # 结果保存在此变量中 text = '' images_cnt = len(images) sum_time = 0 for i, img in enumerate(images): # 计算识别耗时 print(f'start {i + 1} / {images_cnt}...') start_time = time.time() img = np.array(img) # 识别图片中的文本 text += pytesseract.image_to_string(img, lang='eng+chi') # 打印识别耗时 end_time = time.time() print(f'done {i + 1} / {images_cnt} use time: {end_time - start_time}\n') sum_time += end_time - start_time print(f'sum use time: {sum_time}') return text fname = 'test.pdf' text = pdf_ocr(fname) # 将结果写入到文件中 with open('result.txt', 'w') as f: f.write(text)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
使用 Python OCR 将 PDF 转换成文本内容
测试平台
系统:macOS 10.14.6
Python:Python 3.8.5
准备工作
安装 tesseract:
brew install tesseract
安装 poppler:
brew install poppler
安装 pytesseract:
pip3 install pytesseract
安装 pdf2image:
pip3 install pdf2image
安装 numpy:
pip3 install numpy
安装pillow:
pip3 install pillow
代码
The text was updated successfully, but these errors were encountered: