【转】Mac安装使用tesseract-ocr

0.介绍

Tesseract是一个开源的OCR引擎，能识别100多种语言（中，英，韩，日，德，法...等等），但是Tesseract对手写的识别能力较差。

1.安装

//先安装依赖库libpng, jpeg, libtiff, leptonica
brew install leptonica

//安装tesseract的同时安装训练工具
brew install --with-training-tools tesseract

//安装tesseract的同时安装所有语言，语言包比较大，如果安装的话时间较长，建议不安装，按需选择
brew install  --all-languages tesseract

//安装tesseract，并安装训练工具和语言
brew install --all-languages --with-training-tools tesseract 

//只安装tesseract，不安装训练工具
brew install  tesseract

2.下载语言库

下载地址:https://github.com/tesseract-ocr/tessdata

根据自己的需求选择所要的语言库，在这里我们选择的是简体中文所以选择的库是：chi_sim.traineddata、eng.traineddata
将文件拷贝到到：/usr/local/Cellar/tesseract/3.04.01_2/share/tessdata目录下。

3.Tesseract使用
终端输入命令:tesseract --help

一般使用:

//默认使用eng文字库， imgName是图片的地址，result识别结果

tesseract imgName result

指定语言:

//指定使用简体中文
tesseract -l chi_sim imgName result
//查看本地存在的语言库
tesseract --list-langs

指定多语言:

//指定多语言，用+号相连
tesseract -l chi_sim+eng imgName result

有个地方需要特别注意，参数psm

//输入命令，查看psm的参数
tesseract --help-psm

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.

 翻译（可能不是很准,最好看原文）：
 0 定向脚本监测（OSD）
 1 使用OSD自动分页
 2 自动分页，但是不使用OSD或OCR（Optical Character Recognition，光学字符识别）
 3 全自动分页，但是没有使用OSD（默认）
 4 假设可变大小的一个文本列。
 5 假设垂直对齐文本的单个统一块。
 6 假设一个统一的文本块。
 7 将图像视为单个文本行。
 8 将图像视为单个词。
 9 将图像视为圆中的单个词。
 10 将图像视为单个字符。

根据情况选择不同的psm值，这很重要，如果选择到不恰当的值会导致识别失败。

原文链接

【转】Mac安装使用tesseract-ocr

0.介绍

1.安装

2.下载语言库

你可能感兴趣的:(【转】Mac安装使用tesseract-ocr)