最近对验证码识别做了一些研究,主要是OCR方向的,一些总结记录一下。识别CAPTCHA后面跟了很多参考文章都讲解的很详细了,做ORC不难,难点在于如何提高识别率。基本流程如下:
1.原图
2.预处理(去噪点)
3.标准化(灰度变换,二值化,归一化)
4.image segment(个人感觉这个比较难,有很多算法,比如垂直投影直方图,KNN,Color Filling)
5.提取特征
6.机器学习
7.识别
总之OCR是一个很有意思的研究课题,包含大量对计算机图形图像,机器学习,神经网络方面的研究,可以作为一个问题点来研究机器学习。网上已经有一个学习好的手写体样本库MNIST可供玩耍。附件另有一份是VSM向量空间模型理论的论文,清楚的讲解了如何计算两个对象之间的相似性。
0.PIL简单的API使用:
# -*- coding: utf-8 -*- path = "/home/yunpeng/test4/data/4399/simple/8.png" from PIL import Image,ImageDraw im =Image.open(path) im = im.convert('L') #二值化 print 'img info:',im.format,im.size,im.mode width,height = im.size for x in xrange(width): for y in xrange(height): p= im.getpixel((x, y)) if p>90: im.putpixel((x,y),255) else: im.putpixel((x,y),0) #去头去尾 mlist = set([]) p = im.load() for x in xrange(width): for y in xrange(height): p= im.getpixel((x, y)) if p<200: mlist.add(x) mlist = list(mlist) left= mlist[:1][0] right = mlist[len(mlist)-1:][0] box = (left, 0, right, height) im = im.crop(box) width,height = im.size ps = [0]*width for x in xrange(width): for y in xrange(height): p= im.getpixel((x, y)) if p==0: ps[x]=ps[x]+4 image = Image.new('RGB',(200,200),(255,255,255)) draw = ImageDraw.Draw(image) ps_width = len(ps) for k in xrange(ps_width): source = (k,199) #起点坐标y=99, x=[0,1,2....] target = (k,199-ps[k]) #终点坐标y=255-a[x],a[x]的最大数值是200,x=[0,1,2....] draw.line([source, target], (100,100,100),1) image.show() im.show()
1.什么是灰度变换?
Photoshop里的灰度变换可以使R,G,B 3色按任何比例增强再混合。黑白图片的黑白变换叫灰度变换,彩色图片的色彩变换也叫灰度变换。
比如线性变换
可以用一个线性函数:f(x,y)=a'+(b'-a')/(b-a)×(f(x,y)-a)
f(x,y)代表一个象素
[a,b]是原始图像的灰度范围,[a',b']是变换后新图像的灰度范围
用这个线性函数分别对R,G,B分量进行变换可以起到单色增强的目的,然后再混合输出。
如果b'-a' > b-a ,则使得图像灰度范围增大,即对比度增大,图像会变得清晰
如果b'-a' < b-a ,则使得图像灰度范围缩小,即对比度减小。
PS: PIL可以通过im.convert('L')
2.什么是直方图?
直方图就是统计图像中像素点为某个颜色值的个数。
参考:
3.tesseract如何安装?
参考:
4. 参考资料
Python图像处理库(PIL)--基本概念和类库介绍
http://www.cnblogs.com/wei-li/archive/2012/04/19/2443281.html
http://www.cnblogs.com/wei-li/archive/2012/04/19/2456725.html
http://iysm.net/?tag=pil
用Python做图像处理:
http://blog.csdn.net/gzlaiyonghao/article/details/1852726
计算图像相似度——《Python也可以》之一
http://blog.csdn.net/gzlaiyonghao/article/details/2325027
10 行代码判定色*情*图片——Python 也可以系列之二
http://blog.csdn.net/gzlaiyonghao/article/details/3166735
用BP人工神经网络识别手写数字——《Python也可以》之三
http://blog.csdn.net/gzlaiyonghao/article/details/7109898
大规模识别相似图像的算法探讨(比较浅)
http://caocao.iteye.com/blog/149776
用PIL实现滤镜(一)——素描、铅笔画效果
http://blog.sina.com.cn/s/blog_5eeb1e2f0101axvi.html
图像处理之霍夫变换(直线检测算法)
http://blog.csdn.net/jia20003/article/details/7724530
python 简单图像处理(最详细1-16篇,包括细化,傅立叶变换,)
http://www.cnblogs.com/xianglan/category/272764.html
使用(ImageMagick+tesseract-ocr)实现图像验证码识别实例 (识别读比较高):
http://blog.csdn.net/mlks_2008/article/details/8052782
tesseract-ocr训练方法:
http://www.lixin.me/blog/2012/05/26/29536
OCR学习及tesseract的一些测试:
http://blog.csdn.net/viewcode/article/details/7784600
某网站验证码的识别笔记(去除背景色):
http://blog.csdn.net/bh20077/article/details/7041280
用imagemagick和tesseract-ocr破解简单验证码(ruby):
http://hooopo.iteye.com/blog/993538
使用 Python 构造神经网络(IBM Hopfield 网络可以重构失真的图案并消除噪声)
http://www.ibm.com/developerworks/cn/linux/l-neurnet/
常见验证码的弱点与验证码识别
http://drops.wooyun.org/tips/141
一种通用的去除文字图像中干扰线的算法:
http://wenku.baidu.com/view/63bac64f2b160b4e767fcfed.html
Decoding CAPTCHA’s:
http://www.boyter.org/decoding-captchas/
===================================================================
Tesseract OCR 训练和识别总结:
http://miphol.com/muse/2013/06/tesseract-ocr-1.html
http://miphol.com/muse/2013/05/tesseract-ocr.html
Tesseract-OCR 字符识别---样本训练(使用jTessBoxEditor工具,比较详细)
http://blog.csdn.net/firehood_/article/details/8433077
Tesseract-OCR引擎 入门
http://blog.csdn.net/xiaochunyong/article/details/7193744
Tesseract官方配置
http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html
粘连字符的图片验证码识别
http://wenku.baidu.com/view/343c200c581b6bd97f19ead9.html
字符扭曲粘连验证码识别技术研究
http://wenku.baidu.com/view/45896630580216fc700afd16.html
-----------------------------------------------------------------------
wiki:
http://zh.wikipedia.org/zh-cn/captcha
http://en.wikipedia.org/wiki/Image_segmentation
Python Module for Mean Shift Image Segmentation:
http://code.google.com/p/pymeanshift/
淘宝验证码:
http://pin.aliyun.com/get_img?identity=taoquan.taobao.com&sessionid=1381293634479
验证码识别工具-tesseract(最详细)
http://hilojack.sinaapp.com/?p=866
如何识别高级的验证码 鬼仔's Blog(最高级)
http://huaidan.org/archives/2085.html
浅谈OCR之Tesseract:
http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html
tesseract-ocr使用方法总结:
http://hyhx2008.github.io/tesseract-ocrshi-yong-fang-fa-zong-jie.html
开源OCR引擎Tesseract
http://hi.baidu.com/lifulinghan/item/b59af9eb1d92282d5a7cfb69
使用tesseract-ocr破解网站验证码
http://grunt1223.iteye.com/blog/904313
breaking weak captcha in slightly more than 26 lines of groovy-code
http://www.kellyrob99.com/blog/2010/03/14/breaking-weak-captcha-in-slightly-more-than-26-lines-of-groovy-code/
tesseract-ocr3.02的用法详解(训练词库)
http://www.cnblogs.com/huyulin/p/3305563.html
关于tesseract-ocr3的训练和使用
http://www.cnblogs.com/zcsor/archive/2011/02/21/1959555.html
tesseract java api
http://stackoverflow.com/questions/13974645/using-tesseract-from-java
tesseract python api
http://code.google.com/p/pytesser/
https://github.com/rosarior/pytesser
https://code.google.com/p/pytesser/wiki/README
识别验证码,你有几分成功率?
http://aoingl.iteye.com/blog/1389232
http://ptlogin.4399.com/ptlogin/captcha.do?captchaId=captchaReq011404b815f6235726
http://www.andrew.cmu.edu/user/ericwu/parch/finalreport.html()
[1] L. von Ahn, M. Blum and J. Langford. Telling Humans and Computer Apart
Automatically[J], Comm. Of the ACM, 46(Aug. 2003), 57-60.
[2] K. Chellapilla, K. Larson, P. Simard and M. Czerwinski, Building Segmentation
Based Human-friendly Human Interaction Proofs[C], 2nd Int’l Workshop on Human Interaction Proofs, Springer-Verlag, LNCS 3517, 2005.
[3] J. Yan and A. S. EI. Ahmad. Usability of CAPTCHAs - Or, Usability issues in
CAPTCHA design[C], the fourth Symposium on Usable Privacy and Security, Pittsburgh, USA, July 2008.
[4] K. Chellapilla, K. Larson, P. Simard, M. Czerwinski, Computers beat humans at
single character recognition in reading-based Human Interaction Proofs[C], In 2nd Conference on Email and Anti-Spam (CEAS’05), 2005.
[5] J. Yan and A. S. El Ahmad. A Low-cost Attack on a Microsoft CAPTCHA[C], 15th
ACM Conference on Computer and Communications Security (CCS’08). Virginia, USA, Oct 27-31, 2008. ACM Press. 543-554.
[6] Microsoft Corporation. Human Interaction Proof (HIP) - Technical and Market
Overview[J], 2006. Accessed in Jan 2011.
[7] J. Yan and A. S. El Ahmad. Breaking Visual CAPTCHAs with Naive Pattern
Recognition Algorithms[C], in Proc. of the 23rd Annual Computer Security Applications Conference (ACSAC’07). FL, USA, Dec 2007. IEEE computer society. 279-291.
[8] G. Mori and J. Malik. Recognizing Objects in Adversarial Clutter: Breaking a
Visual CAPTCHA[C], IEEE Conference on Computer Vision and Pattern Recognition(CVPR'03), Vol 1, June 2003, 134-141.
[9] G. Moy, N. Jones, C. Harkless and R. Potter. Distortion estimation techniques in
solving visual CAPTCHAs[C], IEEE CVPR, 2004.
[10] K. Chellapilla and P. Simard, Using Machine Learning to Break Visual Human
Interaction Proofs[M], Neural Information Processing Systems (NIPS), MIT Press, 2004.
[11] L. von Ahn, M.Blum, N. J. Hopper, and J. Langford, CAPTCHA: Using hard AI
problems for security[C]. Eurocrypt’2003.
[12] W. Zhang, J. Sun, and X. Tang. Cat head detection -how to effectively exploit shape and texture features[C]. In Proc. ECCV 2008, Part IV, LNCS 5305 (2008), 802–816.
[13] P. Golle. Machine learning attacks against the Asirra CAPTCHA[C]. In ACM
CCS’2008, 535-542.
[14] http://recaptcha.net/learnmore.html,2012-10-19。
[15] Elie Bursztein, Matthieu Martin, and John C. Mitchell. Text-based CAPTCHA
strengths and weaknesses[C]. 18th ACM conference,2011.
[16] Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel
Blum, 2008. reCAPTCHA: Human- Based Character Recognition via Web Security Measures[J]. Science, 321(5895):1465-1468.
[17] 李颖,Web验证码生成和识别[D]。南京理工大学2008 研究生论文。
[18] Zeidenberg, Matthew. Neural Networks in Artificial Intelligence[M]. 1990: Ellis
Horwood Limited. 1990. ISBN 0-13-612185-3.
[19] 张淑雅,赵一鸣,赵晓宇等.认证码字符识别方法的研究[J].宁波大学学报:
理工版,2007,12(4):429-433.
[20] 潘大夫,汪渤.一种基于外部轮廓的数字验证码识别方法[J],微计算机信息:
测控自动化,2007,23(9-1):0256-0258.
[21] 贾磊磊,陈锡华,熊川,验证码的模糊识别[J],西昌学院学报:自然科学版,
2010,24(1):60-62