女朋友有一天找我,问有份PDF能不能转成word,我看了一下,网上付费转的还真不少,有幸找到白嫖百度OCR的方案,特此记录
Python安装百度包 baidu-aip
代码如下
# This is a sample Python script.
# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.
from pdf2image import convert_from_path
from aip import AipOcr
import os,sys
os.chdir(sys.path[0])
APP_ID = '****'
API_KEY = '****'
SECRET_KEY = '****'
client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
def baidu_ocr(fname):
f = open('result.txt', 'w', encoding='utf-8')
dirname = fname.rsplit('.', 1)[0]
if not os.path.exists(dirname):
os.mkdir(dirname)
images = convert_from_path(fname, fmt='png', output_folder=dirname, poppler_path=r'D:\poppler-0.68.0\bin')
images = os.listdir(dirname)
os.chdir(dirname)
for img in images:
print(img)
with open(img,'rb') as fimg:
img = fimg.read() # 根据'PIL.PngImagePlugin.PngImageFile'对象的filename属性读取图片为二进制
msg = client.basicAccurate(img)
for i in msg.get('words_result'):
f.write('{}\n'.format(i.get('words')))
f.write('\f\n')
print("write done")
f.close()
def print_hi(name):
# Use a breakpoint in the code line below to debug your script.
print(f'Hi, {name}') # Press Ctrl+F8 to toggle the breakpoint.
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
print_hi('PyCharm')
baidu_ocr('2.pdf')
# See PyCharm help at https://www.jetbrains.com/help/pycharm/
将PDF放入根目录,修改响应名称,等待转换完成,最后结果会输出至result.txt