wxPython利用pytesser模块实现图片文字识别

pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。

pytesser 调用了 tesseract。在python中调用pytesser模块,pytesser又用tesseract识别图片中的文字。

下面是整个过程的实现步骤:


这个是免安装的,可以放在python安装文件夹的\Lib\site-packages\  下直接使用

pytesser里包含了tesseract.exe和英语的数据包(默认只识别英文),还有一些示例图片,所以解压缩后即可使用。
可通过以下代码测试:
>>> from pytesser import *

>>> image = Image.open('fnord.tif')  # Open image object using PIL

>>> print image_to_string(image)     # Run tesseract.exe on image

fnord

>>> print image_file_to_string('fnord.tif')

fnord
from pytesser import * 

#im = Image.open('fnord.tif') 

#im = Image.open('phototest.tif') 

#im = Image.open('eurotext.tif')

im = Image.open('fonts_test.png')

text = image_to_string(im) 

print text
注:该模块需要PIL库的支持。

2、解决识别率低的问题
可以增强图片的显示效果,或者将其转换为黑白的,这样可以使其识别率提升不少:

enhancer = ImageEnhance.Contrast(image1)

image2 = enhancer.enhance(4)

可以再对image2调用 image_to_string识别

3、识别其他语言
tesseract是一个命令行下运行的程序,参数如下:

tesseract  imagename outbase [-l  lang]  [-psm N]  [configfile...]

imagename是输入的image的名字
outbase是输出的文本的名字,默认为outbase.txt
-l  lang  是定义要识别的的语言,默认为英文

通过以下步骤可以识别其他语言:

(1)、下载其他语言数据包:
将语言包放入pytesser的tessdata文件夹下
接下来修改pytesser.py的参数,下面是一个例子:

"""OCR in Python using the Tesseract engine from Google

http://code.google.com/p/pytesser/

by Michael J.T. O'Kelly

V 0.0.2, 5/26/08"""



import Image

import subprocess

import os

import StringIO



import util

import errors





tesseract_exe_name = 'dlltest' # Name of executable to be called at command line

scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format

scratch_text_name_root = "temp" # Leave out the .txt extension

_cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation

_language = "" # Tesseract uses English if language is not given

_pagesegmode = "" # Tesseract uses fully automatic page segmentation if psm is not given (psm is available in v3.01)



_working_dir = os.getcwd()



def call_tesseract(input_filename, output_filename, language, pagesegmode):

        """Calls external tesseract.exe on input file (restrictions on types),

        outputting output_filename+'txt'"""

        current_dir = os.getcwd()

        error_stream = StringIO.StringIO()

        try:

                os.chdir(_working_dir)

                args = [tesseract_exe_name, input_filename, output_filename]

                if len(language) > 0:

                        args.append("-l")

                        args.append(language)

                if len(str(pagesegmode)) > 0:

                        args.append("-psm")

                        args.append(str(pagesegmode))

                try:

                        proc = subprocess.Popen(args)

                except (TypeError, AttributeError):

                        proc = subprocess.Popen(args, shell=True)

                retcode = proc.wait()

                if retcode!=0:

                        error_text = error_stream.getvalue()

                        errors.check_for_errors(error_stream_text = error_text)

        finally:  # Guarantee that we return to the original directory

                error_stream.close()

                os.chdir(current_dir)



def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag):

        """Converts im to file, applies tesseract, and fetches resulting text.

        If cleanup=True, delete scratch files after operation."""

        try:

                util.image_to_scratch(im, scratch_image_name)

                call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm)

                result = util.retrieve_result(scratch_text_name_root)

        finally:

                if cleanup:

                        util.perform_cleanup(scratch_image_name, scratch_text_name_root)

        return result



def image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True):

        """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,

        converts to compatible format and then applies tesseract.  Fetches resulting text.

        If cleanup=True, delete scratch files after operation. Parameter lang specifies used language.

        If lang is empty, English is used. Page segmentation mode parameter psm is available in Tesseract 3.01.

        psm values are:

        0 = Orientation and script detection (OSD) only.

        1 = Automatic page segmentation with OSD.

        2 = Automatic page segmentation, but no OSD, or OCR

        3 = Fully automatic page segmentation, but no OSD. (Default)

        4 = Assume a single column of text of variable sizes.

        5 = Assume a single uniform block of vertically aligned text.

        6 = Assume a single uniform block of text.

        7 = Treat the image as a single text line.

        8 = Treat the image as a single word.

        9 = Treat the image as a single word in a circle.

        10 = Treat the image as a single character."""

        try:

                try:

                        call_tesseract(filename, scratch_text_name_root, lang, psm)

                        result = util.retrieve_result(scratch_text_name_root)

                except errors.Tesser_General_Exception:

                        if graceful_errors:

                                im = Image.open(filename)

                                result = image_to_string(im, cleanup)

                        else:

                                raise

        finally:

                if cleanup:

                        util.perform_cleanup(scratch_image_name, scratch_text_name_root)

        return result

        



if __name__=='__main__':

        im = Image.open('phototest.tif')

        text = image_to_string(im, cleanup=False)

        print text

        text = image_to_string(im, psm=2, cleanup=False)

        print text

        try:

                text = image_file_to_string('fnord.tif', graceful_errors=False)

        except errors.Tesser_General_Exception, value:

                print "fnord.tif is incompatible filetype.  Try graceful_errors=True"

                #print value

        text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False)

        print "fnord.tif contents:", text

        text = image_file_to_string('fonts_test.png', graceful_errors=True)

        print text

        text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True)

        print text





这个是source里面提供的,其实若只要识别其他语言只要添加一个language参数就行了,下面是我的例子:

"""OCR in Python using the Tesseract engine from Google

http://code.google.com/p/pytesser/

by Michael J.T. O'Kelly

V 0.0.1, 3/10/07"""



import Image

import subprocess

import util

import errors



tesseract_exe_name = 'tesseract' # Name of executable to be called at command line

scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format

scratch_text_name_root = "temp" # Leave out the .txt extension

cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation



def call_tesseract(input_filename, output_filename, language):

	"""Calls external tesseract.exe on input file (restrictions on types),

	outputting output_filename+'txt'"""

	args = [tesseract_exe_name, input_filename, output_filename, "-l", language]

	proc = subprocess.Popen(args)

	retcode = proc.wait()

	if retcode!=0:

		errors.check_for_errors()



def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):

	"""Converts im to file, applies tesseract, and fetches resulting text.

	If cleanup=True, delete scratch files after operation."""

	try:

		util.image_to_scratch(im, scratch_image_name)

		call_tesseract(scratch_image_name, scratch_text_name_root,language)

		text = util.retrieve_text(scratch_text_name_root)

	finally:

		if cleanup:

			util.perform_cleanup(scratch_image_name, scratch_text_name_root)

	return text



def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):

	"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,

	converts to compatible format and then applies tesseract.  Fetches resulting text.

	If cleanup=True, delete scratch files after operation."""

	try:

		try:

			call_tesseract(filename, scratch_text_name_root, language)

			text = util.retrieve_text(scratch_text_name_root)

		except errors.Tesser_General_Exception:

			if graceful_errors:

				im = Image.open(filename)

				text = image_to_string(im, cleanup)

			else:

				raise

	finally:

		if cleanup:

			util.perform_cleanup(scratch_image_name, scratch_text_name_root)

	return text

	



if __name__=='__main__':

	im = Image.open('phototest.tif')

	text = image_to_string(im)

	print text

	try:

		text = image_file_to_string('fnord.tif', graceful_errors=False)

	except errors.Tesser_General_Exception, value:

		print "fnord.tif is incompatible filetype.  Try graceful_errors=True"

		print value

	text = image_file_to_string('fnord.tif', graceful_errors=True)

	print "fnord.tif contents:", text

	text = image_file_to_string('fonts_test.png', graceful_errors=True)

	print text




在调用image_to_string函数时,只要加上相应的language参数就可以了,如简体中文最后一个参数即为 chi_sim, 繁体中文chi_tra,
也就是下载的语言包的 XXX.traineddata 文件的名字XXX,如下载的中文包是 chi_sim.traineddata, 参数就是chi_sim :
text = image_to_string(self.im, language = 'chi_sim')

至此,图片识别就完成了。

额外附加一句:有可能中文识别出来了,但是乱码,需要相应地将text转换为你所用的中文编码方式,如:
text.decode("utf8")就可以了

 

你可能感兴趣的:(wxPython)