Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of p_w_picpath formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

  • ReadMe - Installation and usage information.

  • Compiling - How to build Tesseract on a variety of platforms.

  • FAQ - Common questions and problems. Please check before filing a bug or consulting the forum.

  • Too many errors? - See the guidance on getting the best out of Tesseract.


Supported Platforms

Tesseract works on Linux, Windows (with VC++ Express or CygWin) and Mac OSX. See the ReadMe for more details and install instructions. It can also be compiled for other platforms, including Android and the iPhone, though these are not as well tested platforms. See also the AddOnspage for other projects using Tesseract on various platforms.

If you're interested in supporting other platforms or languages, please get in touch with Ray Smith or the Developers.

A Note about Downloads

With the discontinuation of downloads at code.google.com, new source downloads will be posted to GoogleDrive. Other download folders will be setup as new files are uploaded, and the original Downloads page will go away. During the transition, other downloads can still be found at theOld Downloads page.

Roadmap

Version 3.03 release candidate is now available (source only so far) for download and contains many new features. (See the ReleaseNotes for a full list.) Please check out the ReadMe before going to Downloads as you need more than one file. Even the windows executables tarball is incomplete as language files are required. Most notable new features:

  • PDF output.

  • New Renderer for extracting detailed recognition information at a document level.

Version 3.03 ships with recent Linux distributions such as Ubuntu 14.04.

Version 3.02 ships with Ubuntu 12.04

Core Developers

The core developer on the project is Ray Smith (theraysmith).

In related work, Thomas Breuel (tmbdev) and Ilya Mezhirov (mezhirov) work on the OCRopus project, which also provides layout analysis and statistical language modeling.

Most of the work on Tesseract is sponsored by Google.


Download:


Filename
Summary + Labels Uploaded ReleaseDate Size DownloadCount ...
tesseract-ocr-3.02.grc.tar.gz Ancient Greek Language data for Tesseract 3.02.02 Apr 2013 Apr 2013 3.3 MB 20922
tesseract-ocr-3.02.epo_alt.tar.gz Esperanto alternative language data for Tesseract 3.02 Nov 2012 Nov 2012 1.4 MB 6196
tesseract-3.02.02-win32-lib-include-dirs.zip VC++ libraries of Tesseract OCR 3.02.02 (32bit) Featured Nov 2012 Nov 2012 28.0 MB 55968
tesseract-ocr-setup-3.02.02.exe Windows installer of tesseract-ocr 3.02.02 (including English language data) Featured Nov 2012 Nov 2012 12.9 MB 138286
tesseract-ocr-3.02.02.tar.gz Tesseract OCR 3.02.02 Source Featured Nov 2012 Nov 2012 3.7 MB 100480
tesseract-ocr-3.02.02-doc-html.tar.gz Tesseract 3.02.02 html doc Featured Nov 2012 Nov 2012 10.1 MB 51840
tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02 Oct 2012 Oct 2012 12.1 MB 81418