CNOCR、PaddleOCR和Tesseract提取pdf中文字-个人记录

目录

一、PyMuPDF

二、CNOCR

 三、PaddleOCR

四、Tesseract

五、个人测试对比


一、PyMuPDF

1.安装PyMuPDF

pip install pymupdf

2.pdf转txt样例

import os
import datetime
import fitz  # fitz就是pip install PyMuPDF


def pyMuPDF_fitz(pdfPath):
    startTime_pdf2img = datetime.datetime.now()  # 开始时间

    text_list = []
    pdfDoc = fitz.open(pdfPath)
    for page in pdfDoc:
        text = page.get_text()
        text_list.append(text)
    text_list = "\n".join(text_list)
    try:
        with open("/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/test.txt", 'a+') as neirong:
            neirong.write(text_list)
    except IOError as e:
        print("An error occurred while writing the file:", e)


    endTime_pdf2img = datetime.datetime.now()  # 结束时间
    print('pdf2img时间=', (endTime_pdf2img - startTime_pdf2img).seconds)


def process_all_pdfs_in_directory(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(directory, filename)
            pyMuPDF_fitz(pdf_path)


if __name__ == "__main__":
    # 指定PDF所在的目录
    pdf_directory = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/'
    process_all_pdfs_in_directory(pdf_directory)

注:

pymupdf不能直接提取表格,要使用pdfplumber来实现

提取图片使用img=page.getImageList()

提取后发现,文字可以正常提取但是数字不能正常提取

原因:数字在PDF文件中以图像形式呈现,而不是文本形式。这种情况下,提取数字就需要进行OCR(光学字符识别)处理

因此先将pdf转为图片,在对图片提取文字(采用cnocr、paddleocr、tesseract)

pdf转图片:

import os
import datetime
import fitz  # fitz就是pip install PyMuPDF


def pdf_to_images(directory, filename, output_folder):

    pdf_path = os.path.join(directory, filename)
    pdf_doc = fitz.open(pdf_path)
    for page_number in range(len(pdf_doc)):
        page = pdf_doc[page_number]
        image = page.get_pixmap(matrix=fitz.Matrix(4, 4), alpha=False)
        image_path = os.path.join(output_folder, f"{filename[:-4]}_page_{page_number + 1}.png")
        image.save(image_path)
    pdf_doc.close()

def process_all_pdfs_in_directory(directory, output_folder):
    #pdf to img
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            pdf_to_images(directory, filename, output_folder)


if __name__ == "__main__":
    # 指定PDF所在的目录
    pdf_directory = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/books/'
    # 指定输出图片的目录
    output_folder = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books/'
    process_all_pdfs_in_directory(pdf_directory, output_folder)

二、CNOCR

1.安装cnocr

pip install cnocr

2.图片转文字,存入同一个txt文件


import cnocr
import os
import datetime

def recognize_text(txt_directory, image_directory):
    # 初始化 cnocr
   ocr = cnocr.CnOcr()

   text = []
   for filename in os.listdir(image_directory):
      if filename.endswith('.png'):
         startTime_pdf2img = datetime.datetime.now()  # 开始时间

         image_path = os.path.join(image_directory, filename)
         # 读取图片并识别文字
         results = ocr.ocr(image_path)
         # text = [result['text'] for result in results]
         text = ''.join([result['text'].replace('\n', '') for result in results])
         # print(text)
         # sys

         # 读取一张写入一张
         with open(txt_directory, 'a+', encoding='utf-8') as f:
            f.write(text + '\n')
         
         endTime_pdf2img = datetime.datetime.now()  # 结束时间
         print('img2txt时间 =', (endTime_pdf2img - startTime_pdf2img).seconds, ",", filename, "已写入")
   return text
    

if __name__ == "__main__":
    # 图片文件路径
    image_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books'
    # txt文件路径
    txt_directory = "/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/test.txt"
    # 识别文字
    recognize_text(txt_directory, image_directory)

 三、PaddleOCR

步骤:

1.安装PaddleOCR

2.准备pdf文件

3.将pdf转为图片,在对图片提取文字

安装:

1.安装PaddleOCR

pip install "paddleocr>=2.0.1"

2.安装paddlepaddle (默认安装cpu版本,gpu版本目前似乎不支持arm64架构?安装指南-使用文档-PaddlePaddle深度学习平台)

gpu版本安装官网:开始使用_飞桨-源于产业实践的开源深度学习平台

pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
或
pip install paddlepaddle -i https://pypi.tuna.tsinghua.edu.cn/simple


pip install pymupdf --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple

验证paddlepaddle是否安装成功

进入python环境验证paddle是否安装成功
python
import paddle
paddle.utils.run_check() 

3. 图片转文字,存入同一个txt文件

import paddleocr
import os
import datetime
import fitz

def recognize_text(txt_directory, image_directory, pdf_directory):
    # 初始化 PaddleOCR
    ocr = paddleocr.PaddleOCR(use_angle_cls=True, lang='ch')

    for filename in os.listdir(pdf_directory):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(pdf_directory, filename)
            pdf_doc = fitz.open(pdf_path)
            for page_number in range(len(pdf_doc)):
                image_path = os.path.join(image_directory, f"{filename[:-4]}_page_{page_number + 1}.png")

                startTime_pdf2img = datetime.datetime.now()  # 开始时间
                # 读取图片并识别文字
                results = ocr.ocr(image_path, cls=True)
                text = ''.join([result[1][0] for result in results[0]])
                # print(text)

                # 写入识别结果到文本文件
                with open(txt_directory, 'a+', encoding='utf-8') as f:
                    f.write(text + '\n')

                endTime_pdf2img = datetime.datetime.now()  # 结束时间
                print('img2txt时间 =', (endTime_pdf2img - startTime_pdf2img).seconds, ",", f"{filename[:-4]}_page_{page_number + 1}.png", "已写入")

if __name__ == "__main__":
    # 图片文件路径
    image_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books'
    # txt文件路径
    txt_directory = "/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/testpaddlepaddleocr.txt"
    # 指定PDF所在的目录
    pdf_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/books/'
    # 识别文字
    recognize_text(txt_directory, image_directory, pdf_directory)

4.报错:

from paddleocr import PaddleOCR
import re

ocr = PaddleOCR(lang="ch")  # 使用中文识别
result = ocr.ocr("/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/page_1.png")

for line in result:(myenv)  [scxlab0069@paraai-n32-h-01-ccs-master-1 wzf]$ python /home/bingxing2/ailab/group/ai4agr/wzf/LLM/ocr/paddleocr/img_to_txt.py
download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar to /home/bingxing2/ailab/scxlab0069/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar
100%|███████████████████████████████████████████████████████████████████████| 4.89M/4.89M [00:00<00:00, 13.9MiB/s]


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   inflateReset2

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1715233076 (unix time) try "date -d @1715233076" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x4ad7b62366b7a28) received by PID 4127064 (TID 0x40000b615370) from PID 913013288 ***]

Segmentation fault
   print(line)  # 输出识别结果 报错

解决办法:paddlepaddle2.6版本太高了,重新安装paddlepaddle2.5.2版本即可, 参考CPU版本下的报错信息:`Segmentation fault` is detected by the operating system · Issue #12075 · PaddlePaddle/PaddleOCR · GitHub

四、Tesseract

1 .安装依赖的Leptonica库

wget https://github.com/DanBloomberg/leptonica/releases/download/1.80.0/leptonica-1.80.0.tar.gz

tar -xzvf leptonica-1.80.0.tar.gz

cd leptonica-1.80.0

./configure --prefix=/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm  && make && make install
# --prefix参数为leptonica要安装到的目录

3.将 Leptonica加入环境变量

vim ~/.bashrc

插入

export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm/lib
export LIBLEPT_HEADERSDIR=/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm/include
export PKG_CONFIG_PATH=/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm/lib/pkgconfig

 退出后让配置生效

source ~/.bashrc

4.安装Tesseract-OCR

wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.tar.gz

重名名下压缩包

mv 4.1.1.tar.gz tesseract-4.1.1.tar.gz
  
tar -xzvf tesseract-4.1.1.tar.gz 

cd tesseract-4.1.1/

./autogen.sh

./configure --prefix=/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm  && make && make install

sudo ldconfig

5.下载需要识别的语言种类库,例如需要识别中那就下载中文,需要识别英文就下载英文

#所有的识别库地址:https://github.com/tesseract-ocr/tessdata
中文简体语言库 chi_sim.traineddata
中文繁体 chi_tra.traineddata
英文 eng.traineddata

6.配置Tesseract环境变量

vim ~/.bashrc

#TESSDATA_PREFIX后面的地址是训练库所在的文件目录
PATH=$PATH:/home/bingxing2/ailab/scxlab0069/.conda/envs/test_llm/bin
export PATH
export TESSDATA_PREFIX=/home/bingxing2/ailab/group/ai4agr/wzf/Tools/Tesseract/tessdata  
export PATH=$PATH:$TESSDATA_PREFIX

~/.bashrc

7.测试安装是否成功

tesseract --version

五、个人测试对比

准确率 速度 生成格式 问题
CNOCR
PaddleOCR 较好 较快 .txt 不能识别多栏文本
Tesseract 较差 较慢 .txt 准确率低
pdfminer 较好 非常快 .txt 中文文献无法识别/部分格式pdf无法识别
MinerU 较好 较慢 .md/.json 表格转换为图片,公式/数字转换为Latex格式

参考:

Linux 最全安装Tesseract_linux安装tesseract-CSDN博客

参考:

PaddleOCR—图片文字识别提取—快速使用教程_paddleocr使用教程-CSDN博客

Paddlepaddle-GPU版本安装_paddlepaddle-gpu 安装版本-CSDN博客

【paddle-gpu2.5版本安装踩坑记录】_paddle2.5-CSDN博客

PaddleOCR详解和识别图片中文字_paddle ocr-CSDN博客

【paddle-gpu2.5版本安装踩坑记录】_paddle2.5-CSDN博客

你可能感兴趣的:(python,开发语言)