使用tesseract识别采集到内存中的图片+解决tesseract不识别最左侧字符的问题

扩展阅读:(tesseract配置学习1)[http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version]
扩展阅读:(tesseract配置学习2)[https://stackoverflow.com/questions/13007245/how-to-find-parameters-supported-in-tesseract-ocr-config-file]

本文主要介绍两个问题:

  1. 如何把网页上采集的图片不存到本地直接在内存中识别
    使用image = BytesIO(response.content)转换为流数据
  2. 解决tesseract不识别最左侧字符的问题
    参数中加上config="--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789"

然后直接贡献出代码:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
import pytesseract
from PIL import Image
from io import BytesIO


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
}
url = "http://static8.ziroom.com/phoenix/pc/images/price/e72ac241b410eac63a652dc1349521fd.png"

response = requests.get(url=url, headers=headers)

with open("test.png", "wb") as f:
    f.write(response.content)

image = BytesIO(response.content)
im = Image.open(image)
text = pytesseract.image_to_string(im, lang="eng", config="--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789")
print(text)

你可能感兴趣的:(python)