Tesseract从图片中提取文本(CentOS+Java)

软件下载地址

Tesseract官网

https://github.com/tesseract-ocr/tesseract

Leptonica

http://www.leptonica.org/

CentOS7.2上安装Tesseract

https://github.com/tesseract-ocr/tesseract/wiki

安装步骤

https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux
官网中写的是Ubuntu上安装步骤,在CentOS上有些差别,基础组件一定要安装,例如

sudo apt-get install libpng-dev

在CentOS上命令是

sudo yum install libpng-devel

Leptonica版本请严格按照官网上的下载。需要先安装 Leptonica,再安装 Tesseract。

Tesseract从图片中提取文本(CentOS+Java)_第1张图片

笔者之前选用的是3.05.01 Release 和 leptonica-1.74.4.tar.gz,反复确认leptonica安装无误,遇到的报错是

configure: error: Leptonica 1.7.4 or higher is required. Try to install libleptonica-dev package.

后来使用tesseract-3.0.5.Release 和leptonica-1.7.0版本,图片解析成功。
如果libpng, libjpeg, libtiff这三个依赖包未安装,还会遇到的报错是

Tesseract Open Source OCR Engine v3.05.00 with Leptonica
Error in pixReadStreamJpeg: function not present
Error in pixReadStream: jpeg: no pix returned
Error in pixRead: pix not read
Error during processing.

安装三个依赖包后,需要重新编译leptonica。

常见报错

https://github.com/tesseract-ocr/tesseract/wiki/Compiling
Tesseract从图片中提取文本(CentOS+Java)_第2张图片

Windows上安装Tesseract

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows
直接安装即可,注意选用中文简繁体语言包

Java中调用Tesseract进行图片文本识别


/**
     * 判断当前操作系统是 linux or windows
     * @author eko.zhan at 2017年12月19日 下午6:20:54
     * @return
     */
    private boolean isLinux(){
        Properties prop = System.getProperties();
        String defaultOS = prop.getProperty("os.name").toUpperCase();
        if (defaultOS.indexOf(OS_LINUX) > -1) {
            return true;
        }
        return false;
    }
    /**
     * 运行 tesseract 进行图片识别
     * @author eko.zhan at 2017年12月19日 下午6:22:49
     * @param inputPath
     * @param outputPath
     * @throws IOException
     * @throws InterruptedException
     */
    private void runCmd(String inputPath, String outputPath) throws IOException, InterruptedException{
        String command = null;
        Process process = null;
        if (isLinux()){
            command = "tesseract " + inputPath + " " + outputPath + " -l chi_sim";
            process = Runtime.getRuntime().exec(new String[]{"/bin/sh", "-c", command});
        }else{
            command = TESSERACT_PATH + "/tesseract " + inputPath + " " + outputPath + " -l chi_sim";
            process = Runtime.getRuntime().exec(command);
        }

        InputStream inputStream = process.getInputStream();
        List list = IOUtils.readLines(inputStream, UTF8);
        StringBuffer result = new StringBuffer();
        for (String s : list) {
            result.append(s);
        }
        logger.debug(result);
        process.waitFor();
    }

你可能感兴趣的:(centos,java,tesseract)