python爬虫 使用tesseract识别验证码

最近有个需求,需要处理验证码,以前的解决方案是接打码平台进行处理,但是一个好的爬虫不应该是0成本吗?于是,强迫自己花了一天的时间好好的研究了一下tesseract, 在此总结以下步骤

  1. 获取灰度图
image = Image.open(image_path).convert('L')
  1. 二值化图像
#1.画直方图显示灰度值分布情况, 观察获取阀值 
w, h = image.size
gray_dict = defaultdict(int)
for x in range(w):
    for y in range(h):
        pixel = image.getpixel((x, y))
        gray_dict[pixel] += 1
plt.bar(list(gray_dict.keys()), list(gray_dict.values()))
plt.show()
#2.二值化
image.point(lambda x:0 if x < 阀值 else 255)
#image.point(lambda x:0 if x < 阀值 else 1, '1')
  1. 去除干扰
遍历像素点,根据像素点周围的八个点的颜色值,判断是否为噪点,是噪点把它设置为白色,具体情况根据图像进行处理
  1. 切割图片
如果图像中的字符紧挨着,会导致识别率很低,这种情况下最好先进行切割处理
  1. 识别验证码
text = pytesseract.image_to_string(image, lang=语言包)
#若没有识别出任何字符,请根据情况尝试添加config参数 例如 text = pytesseract.image_to_string(image, lang=语言包,  config='--psm 7')
'''
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Here is a sample usage of image_to_string with multiple parameters.
target = pytesseract.image_to_string(image, lang='eng', boxes=False, \
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
'''
  1. 训练数据
# 如果上一步的识别正确率低,使用jTessBoxEditor训练数据获取语言包,个人习惯在jTessBoxEditor上进行全部的操作,而不是使用复杂的命令行

以上的步骤可以应对大部分简单的验证码识别,但具体怎么处理要根据实际情况

你可能感兴趣的:(python爬虫 使用tesseract识别验证码)