几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829
本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者
(注:本文目前可能还处于修改中,如需转载,害人害己)
jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的
实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField
LazyTextExtractorField类的源码如下:
/** * <code>LazyTextExtractorField</code> implements a Lucene field with a String * value that is lazily initialized from a given {@link Reader}. In addition * this class provides a method to find out whether the purpose of the reader * is to extract text and whether the extraction process is already finished. * * @see #isExtractorFinished() */ public class LazyTextExtractorField extends AbstractField { /** * The logger instance for this class. */ private static final Logger log = LoggerFactory.getLogger(LazyTextExtractorField.class); /** * The exception used to forcibly terminate the extraction process * when the maximum field length is reached. */ private static final SAXException STOP = new SAXException("max field length reached"); /** * The extracted text content of the given binary value. * Set to non-null when the text extraction task finishes. */ private volatile String extract = null; /** * Creates a new <code>LazyTextExtractorField</code> with the given * <code>name</code>. * * @param name the name of the field. * @param reader the reader where to obtain the string from. * @param highlighting set to <code>true</code> to * enable result highlighting support */ public LazyTextExtractorField( Parser parser, InternalValue value, Metadata metadata, Executor executor, boolean highlighting, int maxFieldLength) { super(FieldNames.FULLTEXT, highlighting ? Store.YES : Store.NO, Field.Index.ANALYZED, highlighting ? TermVector.WITH_OFFSETS : TermVector.NO); executor.execute( new ParsingTask(parser, value, metadata, maxFieldLength)); } /** * Returns the extracted text. This method blocks until the text * extraction task has been completed. * * @return the string value of this field */ public synchronized String stringValue() { try { while (!isExtractorFinished()) { wait(); } return extract; } catch (InterruptedException e) { log.error("Text extraction thread was interrupted", e); return ""; } } /** * @return always <code>null</code> */ public Reader readerValue() { return null; } /** * @return always <code>null</code> */ public byte[] binaryValue() { return null; } /** * @return always <code>null</code> */ public TokenStream tokenStreamValue() { return null; } /** * Checks whether the text extraction task has finished. * * @return <code>true</code> if the extracted text is available */ public boolean isExtractorFinished() { return extract != null; } private synchronized void setExtractedText(String value) { extract = value; notify(); } /** * Releases all resources associated with this field. */ public void dispose() { // TODO: Cause the ContentHandler below to throw an exception } /** * The background task for extracting text from a binary value. */ private class ParsingTask extends DefaultHandler implements Runnable { private final Parser parser; private final InternalValue value; private final Metadata metadata; private final int maxFieldLength; private final StringBuilder builder = new StringBuilder(); public ParsingTask( Parser parser, InternalValue value, Metadata metadata, int maxFieldLength) { this.parser = parser; this.value = value; this.metadata = metadata; this.maxFieldLength = maxFieldLength; } public void run() { try { InputStream stream = value.getStream(); try { parser.parse(stream, this, metadata, new ParseContext()); } finally { stream.close(); } } catch (Throwable t) { if (t != STOP) { log.warn("Failed to extract text from a binary property", t); } } finally { value.discard(); } setExtractedText(builder.toString()); } @Override public void characters(char[] ch, int start, int length) throws SAXException { builder.append( ch, start, Math.min(length, maxFieldLength - builder.length())); if (builder.length() >= maxFieldLength) { throw STOP; } } @Override public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { characters(ch, start, length); } } }
从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的
这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利
jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面
线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本
需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类
JackrabbitParser类的源码如下:
/** * Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser} * for all parsing requests, but sets it up with Jackrabbit-specific * configuration and implements backwards compatibility support for old * <code>textExtractorClasses</code> configurations. * * @since Apache Jackrabbit 2.0 */ class JackrabbitParser implements Parser { /** * Logger instance. */ private static final Logger logger = LoggerFactory.getLogger(JackrabbitParser.class); /** * Flag for blocking all text extraction. Used by the Jackrabbit test suite. */ private static volatile boolean blocked = false; /** * The configured Tika parser. */ private final AutoDetectParser parser; /** * Creates a parser using the default Jackrabbit-specific configuration * settings. */ public JackrabbitParser() { InputStream stream = JackrabbitParser.class.getResourceAsStream("tika-config.xml"); try { if (stream != null) { try { parser = new AutoDetectParser(new TikaConfig(stream)); } finally { stream.close(); } } else { parser = new AutoDetectParser(); } } catch (Exception e) { // Should never happen throw new RuntimeException( "Unable to load embedded Tika configuration", e); } } /** * Backwards compatibility method to support old Jackrabbit 1.x * <code>textExtractorClasses</code> configurations. Implements a best * effort mapping from the old-style text extractor classes to * corresponding Tika parsers. * * @param classes configured list of text extractor classes */ public void setTextFilterClasses(String classes) { Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>(); StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f"); while (tokenizer.hasMoreTokens()) { String name = tokenizer.nextToken(); if (name.equals( "org.apache.jackrabbit.extractor.HTMLTextExtractor")) { parsers.put(MediaType.text("html"), new HtmlParser()); } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) { Parser parser = new OfficeParser(); parsers.put(MediaType.application("vnd.ms-excel"), parser); parsers.put(MediaType.application("msexcel"), parser); parsers.put(MediaType.application("excel"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) { parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser()); } else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor") || name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) { Parser parser = new OfficeParser(); parsers.put(MediaType.application("vnd.ms-powerpoint"), parser); parsers.put(MediaType.application("mspowerpoint"), parser); parsers.put(MediaType.application("powerpoint"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) { Parser parser = new OfficeParser(); parsers.put(MediaType.application("vnd.ms-word"), parser); parsers.put(MediaType.application("msword"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) { Parser parser = new OfficeParser(); parsers.put(MediaType.application("vnd.ms-word"), parser); parsers.put(MediaType.application("msword"), parser); parsers.put(MediaType.application("vnd.ms-powerpoint"), parser); parsers.put(MediaType.application("mspowerpoint"), parser); parsers.put(MediaType.application("vnd.ms-excel"), parser); parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser); parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser); parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) { Parser parser = new OpenDocumentParser(); parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser); parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser); parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser); parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser); parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser); parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser); parsers.put(MediaType.application("vnd.sun.xml.calc"), parser); parsers.put(MediaType.application("vnd.sun.xml.draw"), parser); parsers.put(MediaType.application("vnd.sun.xml.impress"), parser); parsers.put(MediaType.application("vnd.sun.xml.writer"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) { parsers.put(MediaType.application("pdf"), new PDFParser()); } else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) { parsers.put(MediaType.TEXT_PLAIN, new TXTParser()); } else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) { Parser parser = new ImageParser(); parsers.put(MediaType.image("png"), parser); parsers.put(MediaType.image("apng"), parser); parsers.put(MediaType.image("mng"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) { Parser parser = new RTFParser(); parsers.put(MediaType.application("rtf"), parser); parsers.put(MediaType.text("rtf"), parser); } else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) { Parser parser = new XMLParser(); parsers.put(MediaType.APPLICATION_XML, parser); parsers.put(MediaType.text("xml"), parser); } else { logger.warn("Ignoring unknown text extractor class: {}", name); } } parser.setParsers(parsers); } /** * Delegates the call to the configured {@link AutoDetectParser}. */ public Set<MediaType> getSupportedTypes(ParseContext context) { return parser.getSupportedTypes(context); } /** * Delegates the call to the configured {@link AutoDetectParser}. */ public void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { waitIfBlocked(); parser.parse(stream, handler, metadata, context); } public void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException { parse(stream, handler, metadata, new ParseContext()); } /** * Waits until text extraction is no longer blocked. The block is only * ever activated in the Jackrabbit test suite when testing delayed * text extraction. * * @throws TikaException if the block was interrupted */ private synchronized static void waitIfBlocked() throws TikaException { try { while (blocked) { JackrabbitParser.class.wait(); } } catch (InterruptedException e) { throw new TikaException("Text extraction block interrupted", e); } } /** * Blocks all text extraction tasks. */ static synchronized void block() { blocked = true; } /** * Unblocks all text extraction tasks. */ static synchronized void unblock() { blocked = false; JackrabbitParser.class.notifyAll(); } }
具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)
---------------------------------------------------------------------------
本系列Apache Jackrabbit源码研究系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html