Pytesser——OCR in Python using the Tesseract engine from Google
pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。
链接: https://code.google.com/p/pytesser/
pytesser
调用了
tesseract。在python中调用pytesser模块,pytesser又用tesseract识别图片中的文字。
下面是整个过程的实现步骤:
1、首先要在code.google.com下载pytesser。https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip
这个是免安装的,可以放在python安装文件夹的\Lib\site-packages\ 下直接使用
pytesser里包含了tesseract.exe和英语的数据包(默认只识别英文),还有一些示例图片,所以解压缩后即可使用。
可通过以下代码测试:
- >>> from pytesser import *
- >>> image = Image.open('fnord.tif')
- >>> print image_to_string(image)
- fnord
- >>> print image_file_to_string('fnord.tif')
- fnord
-
"code" class="python">from pytesser import *
-
-
-
- im = Image.open('fonts_test.png')
- text = image_to_string(im)
- print text
-
-
-
注:该模块需要PIL库的支持。
2、解决识别率低的问题
可以增强图片的显示效果,或者将其转换为黑白的,这样可以使其识别率提升不少:
- enhancer = ImageEnhance.Contrast(image1)
- image2 = enhancer.enhance(4)
可以再对image2调用 image_to_string识别
3、识别其他语言
tesseract是一个命令行下运行的程序,参数如下:
tesseract imagename outbase [-l lang] [-psm N] [configfile...]
imagename是输入的image的名字
outbase是输出的文本的名字,默认为outbase.txt
-l lang 是定义要识别的的语言,默认为英文
详见 http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html
通过以下步骤可以识别其他语言:
(1)、下载其他语言数据包:
https://code.google.com/p/tesseract-ocr/downloads/list
将语言包放入pytesser的tessdata文件夹下
接下来修改pytesser.py的参数,下面是一个例子:
-
-
-
-
-
- import Image
- import subprocess
- import os
- import StringIO
-
- import util
- import errors
-
-
- tesseract_exe_name = 'dlltest'
- scratch_image_name = "temp.bmp"
- scratch_text_name_root = "temp"
- _cleanup_scratch_flag = True
- _language = ""
- _pagesegmode = ""
-
- _working_dir = os.getcwd()
-
- def call_tesseract(input_filename, output_filename, language, pagesegmode):
-
-
- current_dir = os.getcwd()
- error_stream = StringIO.StringIO()
- try:
- os.chdir(_working_dir)
- args = [tesseract_exe_name, input_filename, output_filename]
- if len(language) > 0:
- args.append("-l")
- args.append(language)
- if len(str(pagesegmode)) > 0:
- args.append("-psm")
- args.append(str(pagesegmode))
- try:
- proc = subprocess.Popen(args)
- except (TypeError, AttributeError):
- proc = subprocess.Popen(args, shell=True)
- retcode = proc.wait()
- if retcode!=0:
- error_text = error_stream.getvalue()
- errors.check_for_errors(error_stream_text = error_text)
- finally:
- error_stream.close()
- os.chdir(current_dir)
-
- def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag):
-
-
- try:
- util.image_to_scratch(im, scratch_image_name)
- call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm)
- result = util.retrieve_result(scratch_text_name_root)
- finally:
- if cleanup:
- util.perform_cleanup(scratch_image_name, scratch_text_name_root)
- return result
-
- def image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True):
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- try:
- try:
- call_tesseract(filename, scratch_text_name_root, lang, psm)
- result = util.retrieve_result(scratch_text_name_root)
- except errors.Tesser_General_Exception:
- if graceful_errors:
- im = Image.open(filename)
- result = image_to_string(im, cleanup)
- else:
- raise
- finally:
- if cleanup:
- util.perform_cleanup(scratch_image_name, scratch_text_name_root)
- return result
-
-
- if __name__=='__main__':
- im = Image.open('phototest.tif')
- text = image_to_string(im, cleanup=False)
- print text
- text = image_to_string(im, psm=2, cleanup=False)
- print text
- try:
- text = image_file_to_string('fnord.tif', graceful_errors=False)
- except errors.Tesser_General_Exception, value:
- print "fnord.tif is incompatible filetype. Try graceful_errors=True"
-
- text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False)
- print "fnord.tif contents:", text
- text = image_file_to_string('fonts_test.png', graceful_errors=True)
- print text
- text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True)
- print text
这个是source里面提供的,其实若只要识别其他语言只要添加一个language参数就行了,下面是我的例子:
-
-
-
-
-
- import Image
- import subprocess
- import util
- import errors
-
- tesseract_exe_name = 'tesseract'
- scratch_image_name = "temp.bmp"
- scratch_text_name_root = "temp"
- cleanup_scratch_flag = True
-
- def call_tesseract(input_filename, output_filename, language):
-
-
- args = [tesseract_exe_name, input_filename, output_filename, "-l", language]
- proc = subprocess.Popen(args)
- retcode = proc.wait()
- if retcode!=0:
- errors.check_for_errors()
-
- def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):
-
-
- try:
- util.image_to_scratch(im, scratch_image_name)
- call_tesseract(scratch_image_name, scratch_text_name_root,language)
- text = util.retrieve_text(scratch_text_name_root)
- finally:
- if cleanup:
- util.perform_cleanup(scratch_image_name, scratch_text_name_root)
- return text
-
- def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):
-
-
-
- try:
- try:
- call_tesseract(filename, scratch_text_name_root, language)
- text = util.retrieve_text(scratch_text_name_root)
- except errors.Tesser_General_Exception:
- if graceful_errors:
- im = Image.open(filename)
- text = image_to_string(im, cleanup)
- else:
- raise
- finally:
- if cleanup:
- util.perform_cleanup(scratch_image_name, scratch_text_name_root)
- return text
-
-
- if __name__=='__main__':
- im = Image.open('phototest.tif')
- text = image_to_string(im)
- print text
- try:
- text = image_file_to_string('fnord.tif', graceful_errors=False)
- except errors.Tesser_General_Exception, value:
- print "fnord.tif is incompatible filetype. Try graceful_errors=True"
- print value
- text = image_file_to_string('fnord.tif', graceful_errors=True)
- print "fnord.tif contents:", text
- text = image_file_to_string('fonts_test.png', graceful_errors=True)
- print text
在调用image_to_string函数时,只要加上相应的language参数就可以了,如简体中文最后一个参数即为 chi_sim, 繁体中文chi_tra,
也就是下载的语言包的 XXX.traineddata 文件的名字XXX,如下载的中文包是 chi_sim.traineddata, 参数就是chi_sim :
- text = image_to_string(self.im, language = 'chi_sim')
至此,图片识别就完成了。
额外附加一句:有可能中文识别出来了,但是乱码,需要相应地将text转换为你所用的中文编码方式,如:
text.decode("utf8")就可以了