SOLR: tika with OCR engine

I want to parse the content not just the metadata of a jpg picture. 

The following code is the test class

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.ocr.TesseractOCRParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class JpegParse {

    public static void main(final String[] args) throws IOException, SAXException, TikaException, InterruptedException {
	File file = new File("/path/to/menu.jpg");

	BodyContentHandler handler = new BodyContentHandler();

	Metadata metadata = new Metadata();
	FileInputStream inputstream = new FileInputStream(file);
	ParseContext pcontext = new ParseContext();

	TesseractOCRConfig config = new TesseractOCRConfig();
	config.setLanguage("chi");

	config.setTesseractPath("/path/to/tesseract-ocr");
	pcontext.set(TesseractOCRConfig.class, config);

	TesseractOCRParser JpegParser = new TesseractOCRParser();
	pcontext.set(TesseractOCRParser.class, JpegParser);

	JpegParser.parse(inputstream, handler, metadata, pcontext);

	System.out.println("Metadata of the document:");
	String[] metadataNames = metadata.names();
	for (String name : metadataNames) {
	    System.out.println(name + ": " + metadata.get(name));
	}
	System.out.println("Contents of the document:" + handler.toString());
    }
}

 

 Note:

 config.setTesseractPath("/path/to/tesseract-ocr");

 must be parent dir includes  tessdata dir. 

And tesseract    cmd must be linked in this dir

#ln -s /usr/local/bin/tesseract   /path/to/tesseract-ocr

 

 

Preferences

 

https://wiki.apache.org/tika/TikaOCR

http://www.kaiyuanba.cn/html/1/131/227/7891.htm

你可能感兴趣的:(Solr)