gunicorn+flask+PaddleOCR

前言

由于公司是2G,所以一些收费的公网api不能用(同时也不安全),以至于内部尝试了多种开源ocr框架。首先是使用golang封装的一个ocr模块gosseract,使用英文模型多数字字母识别准确率高一点,不过也只有80%多的准确率。后面就尝试用gunicorn+flask+PaddleOCR 简单开发了一个web服务。

gosseract(自己弄一个unbuntu的基础镜像)

dockerfile

RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && \
    apt-get -y install vim wget net-tools curl sudo make telnet iputils-ping tzdata git gcc libtbb2 zip && \
    ln -sf /usr/share/zoneinfo/Asia/Shanghai  /etc/localtime && echo "Asia/Shanghai" > /etc/timezone

RUN apt-get -y install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
RUN git clone https://github.com/tesseract-ocr/tesseract.git && cd tesseract && ./autogen.sh && ./configure && make && make install && ldconfig
#libleptonica 需要创建软连接才能使用
RUN ln -s  /usr/lib/x86_64-linux-gnu/liblept.so /usr/lib/x86_64-linux-gnu/libleptonica.so

 然后自己根基上述打一个基础镜像,自己的golang代码基于这个基础镜像来生成生产镜像。

gunicorn+flask+PaddleOCR

gunicorn是一个wcgi服务,类似网关和反向代理服务(参考php)。能够使用多进程的方式管理应用服务。

dockerfile(基础镜像)

FROM registry.baidubce.com/paddlepaddle/paddle:2.1.3-gpu-cuda10.2-cudnn7

RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
#不然的话会加载其他源 会报错
RUN rm -rf /etc/apt/apt.conf.d/* /etc/apt/sources.list.d/
# 不知道为什么原先的ssl居然不能用 太垃圾了(版本匹配不上)
RUN apt update && apt remove -y libssl-dev && apt install -y libssl-dev

RUN python3 -m pip install paddlepaddle "paddleocr>=2.0.1" -i https://mirror.baidu.com/pypi/simple
RUN pip3 install gunicorn gevent flask  -i https://mirror.baidu.com/pypi/simple

RUN echo "from paddleocr import PaddleOCR" >> download.py
# 预先加载英文模型 防止在代码跑起来之后加载 如果想要加载中文模型 就是复制俩行
RUN echo "PaddleOCR(use_angle_cls=True, lang=\"en\")" >> download.py
RUN python3 download.py

生产镜像

FROM ****/ocr_base:0.0.2

WORKDIR /workspace

COPY ./app/ocr/app.py /workspace/app.py
# 这里面是启动3个worker 不要太多 模型加载之后可是能消耗近2g物理内存
CMD cd /workspace && gunicorn -b 0.0.0.0:8000 -w 3 -k gevent --access-logfile - app:app

其中app.py 就是flask的入口文件

import time
import urllib.request

from flask import Flask, request
from paddleocr import PaddleOCR, draw_ocr


def save_image(url,outputfile):
    try:
        response = urllib.request.urlopen(url)
        data = response.read()
        with open(outputfile, "wb") as file:
            file.write(data)
        return True
    except urllib.error.URLError as e:
        print("Error occurred while retrieving the URL:", e)
        return False

app = Flask(__name__)
ocr = PaddleOCR(use_angle_cls=True, lang="en")

@app.route("/")
def hello():
    print("-------------------")
    return "Hello World!"

@app.post("/ocr/check")
def check_post():
    ret = {}
    req = request.get_json()
    print(req,type(req))
    url = req.get("url")
    if url == None :
        ret["code"] = -1
        ret["msg"] = "param url lost"
        ret["data"] = []
        return ret
    result = ocr.ocr(url, cls=True)
    if len(result) == 0:
        ret["code"] = -3
        ret["msg"] = "ocr result empty"
        ret["data"] = []
        return ret
    data =[]
    for idx in range(len(result)):
        res = result[idx]
        for idx1 in range(len(res)) :
            temp = {}
            res1 = res[idx1]
            temp["text"]=res1[-1][0]
            temp["score"]=res1[-1][1]
            data.append(temp)
    ret["code"] = 0
    ret["msg"] = ""
    ret["data"] = data    
    return ret

if __name__ == "__main__" :
    app.run()

到此你就搭建了一个ocr的web服务了

普通的验证码之类的识别1s 10张 想要更高的性能那你就在生产镜像里面吧worker加到更大,不过消耗的cpu内存也就更多(PaddleOCR其实支持gpu 这里默认是cpu)

你可能感兴趣的:(gunicorn,ocr,golang)