python + paddleOcr 实现文字识别

可能出现的坑

  • AttributeError: partially initialized module 'numpy' has no attribute 'array

        解决:更换numpy的版本,目前最新版本是1.24 需要降低版本,采用1.22 的版本就可解决这个问题。

  • if scores is not None and (scores[i] < drop_score or TypeError: '<' not supp

        解决:更换paddleOcr 版本 最新版本是2.6  V2.6的版本调用ocr.ocr 返回的不是一个数组,是一个字符串,需要进行转换。

环境搭建

1、安装python环境,我这里采用的是Python 3.8 

2、安装 paddlepaddle 

pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple

3、安装layoutparser

pip3 install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

4、安装Shapely

下载地址 https://www.lfd.uci.edu/~gohlke/pythonlibs/#shapely

根据自己的Python版本下载对应的,我这里下载的是Shapely-1.8.2-cp38-cp38-win_amd64.whl

python + paddleOcr 实现文字识别_第1张图片

 把下载好的whl文件放在python安装目录下,执行以下命令

pip install Shapely-1.8.2-cp38-cp38-win_amd64.whl

5、安装paddleOcr 指定版本

pip install paddleocr==2.5.0.3 -i https://mirror.baidu.com/pypi/simple

6、执行脚本

from paddleocr import PaddleOCR, draw_ocr
from PIL import Image
import fitz
import os

# Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换
# 例如`ch`, `en`, `fr`, `german`, `korean`, `japan`
def ocrImg(language,img_path,result_img):
    ocr = PaddleOCR(use_angle_cls=True, use_gpu=False,lang=language)  # need to run only once to download and load model into memory
    img_path = img_path
    result = ocr.ocr(img_path, cls=True)
    for line in result:
        # print(line[-1][0], line[-1][1])
        print(line)
    # 显示结果
    image = Image.open(img_path).convert('RGB')
    boxes = [line[0] for line in result]
    txts = [line[1][0] for line in result]
    scores = [line[1][1] for line in result]
    im_show = draw_ocr(image, boxes, txts, scores, font_path='./fonts/simfang.ttf')
    im_show = Image.fromarray(im_show)
    im_show.save(result_img)
def pdf_to_jpg(name,language):
    ocr = PaddleOCR(use_angle_cls=True, use_gpu=False,lang=language)  # need to run only once to download and load model into memory
    pdfdoc=fitz.open(name)
    temp = 0
    for pg in range(pdfdoc.page_count):
        page = pdfdoc[pg]
        rotate = int(0)
        # 每个尺寸的缩放系数为2,这将为我们生成分辨率提高四倍的图像。
        zoom_x = 2.0
        zoom_y =2.0
        trans = fitz.Matrix(zoom_x, zoom_y).prerotate(rotate)
        pm = page.get_pixmap(matrix=trans, alpha=False)
        pm._writeIMG('temp.jpg',1)

        #ocr识别
        result =ocr.ocr('temp.jpg', cls=True)

        #提取文件名
        xx=os.path.splitext(name)
        filename=xx[0].split('\\')[-1]+'.txt'
        #存储结果
        with open(filename,mode='a') as f:
            for line in result:
                if line[1][1]>0.5:
                    print(line[1][0].encode('utf-8').decode('utf-8'))
                    f.write(line[1][0].encode('utf-8').decode('utf-8')+'\n')
        print(pg)
if __name__ == '__main__':
    # language = sys.argv[1]
    # img_path = sys.argv[2]
    # result_img = sys.argv[2]
    # ocrImg(language, img_path, result_img)
    language = 'ch'
    img_path = 'F:/gs.jpg'
    result_img = 'F:/1result.jpg'
    ocrImg(language,img_path,result_img)
    # pdf_to_jpg(r'F:/1docx.pdf','ch')

测试图片

python + paddleOcr 实现文字识别_第2张图片

识别效果

[2023/01/03 15:05:02] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='C:\\Users\\DELL/.paddleocr/whl\\cls\\ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_fce_box_type='poly', det_limit_side_len=960, det_limit_type='max', det_model_dir='C:\\Users\\DELL/.paddleocr/whl\\det\\ch\\ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_box_type='quad', det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, ir_optim=True, label_list=['0', '180'], lang='ch', layout=True, layout_label_map=None, layout_path_model='lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config', max_batch_size=10, max_text_length=25, min_subgraph_size=15, mode='structure', ocr=True, ocr_version='PP-OCRv3', output='./output', precision='fp32', process_id=0, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='E:\\Python\\Python38\\lib\\site-packages\\paddleocr\\ppocr\\utils\\ppocr_keys_v1.txt', rec_image_shape='3, 48, 320', rec_model_dir='C:\\Users\\DELL/.paddleocr/whl\\rec\\ch\\ch_PP-OCRv3_rec_infer', save_crop_res=False, save_log_path='./log_output/', scales=[8, 16, 32], show_log=True, structure_version='PP-STRUCTURE', table=True, table_char_dict_path=None, table_max_len=488, table_model_dir=None, total_process_num=1, type='ocr', use_angle_cls=True, use_dilation=False, use_gpu=False, use_mp=False, use_onnx=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)
[2023/01/03 15:05:04] ppocr DEBUG: dt_boxes num : 34, elapse : 0.4727060794830322
[2023/01/03 15:05:04] ppocr DEBUG: cls num  : 34, elapse : 0.4208414554595947
[2023/01/03 15:05:22] ppocr DEBUG: rec_res num  : 34, elapse : 17.66183590888977
[[[25.0, 14.0], [298.0, 14.0], [298.0, 31.0], [25.0, 31.0]], ('《古诗三百首》300篇大全集', 0.8422850370407104)]
[[[14.0, 55.0], [217.0, 55.0], [217.0, 67.0], [14.0, 67.0]], ('国学朗读2022-04-07 06:27', 0.9284312129020691)]
[[[22.0, 111.0], [678.0, 111.0], [678.0, 124.0], [22.0, 124.0]], ('古诗二白自》收集了先秦两汉诗、魏普南北朝诗、宋诗、辽金元诗、明诗、清诗/近代诗等', 0.8885766863822937)]
[[[13.0, 134.0], [684.0, 135.0], [684.0, 149.0], [13.0, 148.0]], ('300首经典古诗词,真中既包括《诗经》和《楚辞》单的名篇章,支包括历代天才诗人的杰', 0.8994793891906738)]
[[[15.0, 162.0], [672.0, 162.0], [672.0, 174.0], [15.0, 174.0]], ('作品。在这些诗人中,二国的曹植被誉为“才高八斗”,他的诗不仪辞优美,而且骨气', 0.8431311249732971)]
[[[15.0, 187.0], [671.0, 187.0], [671.0, 201.0], [15.0, 201.0]], ('奇高;东普的陶渊明被称为“古今隐逸诗人之宗”,他的由园诗冲淡平和又韵味隽永;南齐', 0.9045629501342773)]
[[[14.0, 212.0], [672.0, 212.0], [672.0, 226.0], [14.0, 226.0]], ('的谢跳让唐代大诗人李日一生佩服不已,他的诗即使混在唐诗里也显得出类拨萃;北宋的苏', 0.9359258413314819)]
[[[13.0, 237.0], [672.0, 239.0], [672.0, 253.0], [13.0, 251.0]], ('轼和黄庭坚等人开创的宋代诗风,努力在唐诗的基础上推陈出新,对明清两代影响极大;南', 0.8938716053962708)]
[[[14.0, 263.0], [654.0, 263.0], [654.0, 276.0], [14.0, 276.0]], ('宋的陆游一生创作了近首诗,他的诗中饱含看渴望恢复中原和为国牺牲的爱国激情。此', 0.8550498485565186)]
[[[13.0, 289.0], [671.0, 289.0], [671.0, 302.0], [13.0, 302.0]], ('外,对人生的感悟,对家乡、故国的思念,对四季美景的歌咏,对民生疾苦的呼吁它们或慷', 0.8616670370101929)]
[[[14.0, 314.0], [670.0, 314.0], [670.0, 328.0], [14.0, 328.0]], ('既激晶,或生动活泼,或明白晓畅,或含蓄深沉,但都节奏,朗朗上口,处处闪烁着中', 0.963333249092102)]
[[[13.0, 339.0], [181.0, 340.0], [181.0, 354.0], [13.0, 353.0]], ('华经典古诗的智慧之光。', 0.8590850234031677)]
[[[13.0, 367.0], [210.0, 367.0], [210.0, 380.0], [13.0, 380.0]], ('1.《泪船A洲》宋朝王安石', 0.7845669984817505)]
[[[14.0, 392.0], [527.0, 392.0], [527.0, 405.0], [14.0, 405.0]], ('京!洲一水间,钟山只隔数重山。春风绿江南岸,明月何时照浅还', 0.8811277747154236)]
[[[14.0, 417.0], [211.0, 417.0], [211.0, 431.0], [14.0, 431.0]], ('2.《过零丁洋》宋朝文天祥', 0.8679015636444092)]
[[[14.0, 443.0], [657.0, 443.0], [657.0, 457.0], [14.0, 457.0]], ('辛苦遭逢起一经,干戈蓼落四周星。山河破碎风飘絮,身世浮沉雨打萍。煌恐滩头说悍忍', 0.8498225808143616)]
[[[13.0, 468.0], [397.0, 469.0], [397.0, 483.0], [13.0, 482.0]], ('零丁洋里叹零丁。人生自舌古谁无死?留取丹心照汗青。', 0.8797885179519653)]
[[[13.0, 495.0], [178.0, 495.0], [178.0, 508.0], [13.0, 508.0]], ('3.《短歌行》二国·曹操', 0.9484836459159851)]
[[[22.0, 519.0], [672.0, 519.0], [672.0, 533.0], [22.0, 533.0]], ('(短歌行其一)对酒当歌,人生几何!馨如朝露,去日苦多。概当以,忧思难忘。何以解', 0.8763952851295471)]
[[[14.0, 545.0], [684.0, 545.0], [684.0, 559.0], [14.0, 559.0]], ('优?唯有杜康。青青子,悠悠我心。但为君故,沉吟至今。呦呦鹿吗鸣,食之苹。我有嘉.', 0.8525821566581726)]
[[[13.0, 570.0], [178.0, 570.0], [178.0, 584.0], [13.0, 584.0]], ('4.《观沧海》三国·曹操', 0.8883654475212097)]
[[[13.0, 597.0], [661.0, 597.0], [661.0, 610.0], [13.0, 610.0]], ('乐临碣白,以观沧海。水何漳,山岛谏峙。树木丛生,白草丰茂。秋风萧瑟,洪波涌起。', 0.8333690166473389)]
[[[15.0, 623.0], [488.0, 623.0], [488.0, 636.0], [15.0, 636.0]], ('日月之行,若出其中;星汉灿烂,者出其里。莘甚至战!歌以咏..', 0.8013597130775452)]
[[[13.0, 647.0], [82.0, 647.0], [82.0, 661.0], [13.0, 661.0]], ('5.《上邪', 0.9481815099716187)]
[[[14.0, 673.0], [672.0, 673.0], [672.0, 687.0], [14.0, 687.0]], ('上邪!我欲与君相知,长命无绝衰。山无陵,江水为竭,冬雷震震,夏雨雪,天地合,敢', 0.9163383841514587)]
[[[14.0, 698.0], [69.0, 698.0], [69.0, 712.0], [14.0, 712.0]], ('与君绝!', 0.9548820853233337)]
[[[12.0, 723.0], [178.0, 725.0], [178.0, 739.0], [12.0, 737.0]], ('6.《龟星寿》三国·曹操', 0.9079063534736633)]
[[[14.0, 750.0], [660.0, 750.0], [660.0, 764.0], [14.0, 764.0]], ('神龟虽寿,犹有竟时。腾蛇乘雾,终为王灰。老骥伏,志在于里。烈士暮年,壮心不已,', 0.9073294401168823)]
[[[14.0, 775.0], [487.0, 775.0], [487.0, 789.0], [14.0, 789.0]], ('盈缩宿之期,不但在天;养怡之福,可得永年。幸甚至哉,歌以咏..', 0.910293698310852)]
[[[13.0, 801.0], [158.0, 801.0], [158.0, 815.0], [13.0, 815.0]], ('7.《敕勒歌》南北朝', 0.8691929578781128)]
[[[13.0, 827.0], [561.0, 827.0], [561.0, 841.0], [13.0, 841.0]], ('勒川,阴山下。天似弯庐,笼盖四野。天苍苍,野茫茫,风吹草低见牛羊。', 0.9191446900367737)]
[[[13.0, 853.0], [162.0, 853.0], [162.0, 866.0], [13.0, 866.0]], ('8.《所见》清朝意枚', 0.8857056498527527)]
[[[13.0, 878.0], [396.0, 878.0], [396.0, 892.0], [13.0, 892.0]], ('牧童骑黄,歌声振林樾。意欲捕鸣蝉,忽然闭口立。', 0.8716815710067749)]
[[[11.0, 901.0], [177.0, 902.0], [177.0, 916.0], [11.0, 915.0]], ('9《小池》宋朝杨万里', 0.9901900291442871)]

python + paddleOcr 实现文字识别_第3张图片

你可能感兴趣的:(python,python,开发语言)