使用百度OCR完成PDF转txt(python)

女朋友有一天找我,问有份PDF能不能转成word,我看了一下,网上付费转的还真不少,有幸找到白嫖百度OCR的方案,特此记录

首先申请百度OCR的免费试用,创建应用,如下:
使用百度OCR完成PDF转txt(python)_第1张图片

Python安装百度包  baidu-aip

代码如下

# This is a sample Python script.

# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

from pdf2image import convert_from_path
from aip import AipOcr
import os,sys

os.chdir(sys.path[0])

APP_ID = '****'
API_KEY = '****'
SECRET_KEY = '****'
client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

def baidu_ocr(fname):
	f = open('result.txt', 'w', encoding='utf-8')
	dirname = fname.rsplit('.', 1)[0]
	if not os.path.exists(dirname):
		os.mkdir(dirname)
	images = convert_from_path(fname, fmt='png', output_folder=dirname, poppler_path=r'D:\poppler-0.68.0\bin')

	images = os.listdir(dirname)

	os.chdir(dirname)

	for img in images:
		print(img)
		with open(img,'rb') as fimg:
			img = fimg.read() # 根据'PIL.PngImagePlugin.PngImageFile'对象的filename属性读取图片为二进制
		msg = client.basicAccurate(img)
		for i in msg.get('words_result'):
			f.write('{}\n'.format(i.get('words')))
		f.write('\f\n')
	print("write done")
	f.close()

def print_hi(name):
    # Use a breakpoint in the code line below to debug your script.
    print(f'Hi, {name}')  # Press Ctrl+F8 to toggle the breakpoint.


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    print_hi('PyCharm')

    baidu_ocr('2.pdf')

# See PyCharm help at https://www.jetbrains.com/help/pycharm/

将PDF放入根目录,修改响应名称,等待转换完成,最后结果会输出至result.txt

你可能感兴趣的:(Python,python)