在上一篇文章《基于Pyqt5实现笔记本摄像头拍照及PaddleOCR测试》的基础上,继续做了个简单的扩展:
将PDF文档转换为DOC文档。
二、源码修改
1、paddleocr.py文件直接拷贝 Github下载的源码PaddleOCR-release-2.6中的文件,但要注释掉def main():那段代码。
2、修改ui后保存文件,回到PyCharm工程,ocr_camera.ui文件,然后右键,找到"External Tools",选择PyUIC,更新ocr_camera.py文件。
3、main函数修改:
3.1、新增PDF转DOC按钮点击槽链接:
3.2、槽函数定义如下,实现打开文件夹选择pdf文件及转换功能。
def pdfRecognition(self, image_dir):
fname, _ = QFileDialog.getOpenFileName(self, '选择PDF文件', './', 'Image files(*.PDF *.pdf)')
self.showtext.append("loadPDF {}".format(fname))
self.PDFmain(fname)
def PDFmain(self,image_dir):
args = paddleocr.parse_args(mMain=True)
args.use_pdf2docx_api=True
args.recovery=True
args.output='pdf'
print("{}".format(image_dir))
if paddleocr.is_link(image_dir):
paddleocr.download_with_progressbar(image_dir, 'tmp.jpg')
image_file_list = ['tmp.jpg']
else:
image_file_list = get_image_file_list(image_dir)
if len(image_file_list) == 0:
logger.error('no images find in {}'.format(image_dir))
self.showtext.append('no images find in {}'.format(image_dir))
return
engine = paddleocr.PPStructure()
for img_path in image_file_list:
img_name = os.path.basename(img_path).split('.')[0]
logger.info('{}{}{}'.format('*' * 10, img_path, '*' * 10))
img, flag_gif, flag_pdf = paddleocr.check_and_read(img_path)
if not flag_gif and not flag_pdf:
img = cv2.imread(img_path)
if args.recovery and args.use_pdf2docx_api and flag_pdf:
from pdf2docx.converter import Converter
docx_file = os.path.join(args.output,
'{}.docx'.format(img_name))
cv = Converter(img_path)
cv.convert(docx_file)
cv.close()
logger.info('docx save to {}'.format(docx_file))
self.showtext.append('docx save to {}'.format(docx_file))
continue
if not flag_pdf:
if img is None:
logger.error("error in loading image:{}".format(img_path))
continue
img_paths = [[img_path, img]]
else:
img_paths = []
for index, pdf_img in enumerate(img):
os.makedirs(
os.path.join(args.output, img_name), exist_ok=True)
pdf_img_path = os.path.join(
args.output, img_name,
img_name + '_' + str(index) + '.jpg')
cv2.imwrite(pdf_img_path, pdf_img)
img_paths.append([pdf_img_path, pdf_img])
all_res = []
for index, (new_img_path, img) in enumerate(img_paths):
logger.info('processing {}/{} page:'.format(index + 1,
len(img_paths)))
self.showtext.append('processing {}/{} page:'.format(index + 1,
len(img_paths)))
new_img_name = os.path.basename(new_img_path).split('.')[0]
result = engine(new_img_path, img_idx=index)
paddleocr.save_structure_res(result, args.output, img_name, index)
if args.recovery and result != []:
from copy import deepcopy
from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
h, w, _ = img.shape
result_cp = deepcopy(result)
result_sorted = sorted_layout_boxes(result_cp, w)
all_res += result_sorted
if args.recovery and all_res != []:
try:
from ppstructure.recovery.recovery_to_doc import convert_info_docx
convert_info_docx(img, all_res, args.output, img_name)
except Exception as ex:
logger.error(
"error in layout recovery image:{}, err msg: {}".format(
img_name, ex))
continue
for item in all_res:
item.pop('img')
item.pop('res')
logger.info(item)
logger.info('result save to {}'.format(args.output))
self.showtext.append('processing {}/{} page:'.format(index + 1,
len(img_paths)))
三、编译
编译时报错:AttributeError: ‘Document’ object has no attribute ‘pageCount’
网上查说是PyMuPDF版本不对,安装1.18.14版本
pip install PyMuPDF==1.18.14
安装完成后,提示pdf2docx需要PyMuPDF>=1.19.0。先编译,果然报错
pip install PyMuPDF==1.19.0
四、测试
1、第一次测试时,会下载相关模型。