Apache Jackrabbit源码研究(一)

几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829

本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者

(注:本文目前可能还处于修改中,如需转载,害人害己)

jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的

实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField

LazyTextExtractorField类的源码如下:

/**

 * <code>LazyTextExtractorField</code> implements a Lucene field with a String

 * value that is lazily initialized from a given {@link Reader}. In addition

 * this class provides a method to find out whether the purpose of the reader

 * is to extract text and whether the extraction process is already finished.

 *

 * @see #isExtractorFinished()

 */

public class LazyTextExtractorField extends AbstractField {



    /**

     * The logger instance for this class.

     */

    private static final Logger log =

        LoggerFactory.getLogger(LazyTextExtractorField.class);



    /**

     * The exception used to forcibly terminate the extraction process

     * when the maximum field length is reached.

     */

    private static final SAXException STOP =

        new SAXException("max field length reached");



    /**

     * The extracted text content of the given binary value.

     * Set to non-null when the text extraction task finishes.

     */

    private volatile String extract = null;



    /**

     * Creates a new <code>LazyTextExtractorField</code> with the given

     * <code>name</code>.

     *

     * @param name the name of the field.

     * @param reader the reader where to obtain the string from.

     * @param highlighting set to <code>true</code> to

     *                     enable result highlighting support

     */

    public LazyTextExtractorField(

            Parser parser, InternalValue value, Metadata metadata,

            Executor executor, boolean highlighting, int maxFieldLength) {

        super(FieldNames.FULLTEXT,

                highlighting ? Store.YES : Store.NO,

                Field.Index.ANALYZED,

                highlighting ? TermVector.WITH_OFFSETS : TermVector.NO);

        executor.execute(

                new ParsingTask(parser, value, metadata, maxFieldLength));

    }



    /**

     * Returns the extracted text. This method blocks until the text

     * extraction task has been completed.

     *

     * @return the string value of this field

     */

    public synchronized String stringValue() {

        try {

            while (!isExtractorFinished()) {

                wait();

            }

            return extract;

        } catch (InterruptedException e) {

            log.error("Text extraction thread was interrupted", e);

            return "";

        }

    }



    /**

     * @return always <code>null</code>

     */

    public Reader readerValue() {

        return null;

    }



    /**

     * @return always <code>null</code>

     */

    public byte[] binaryValue() {

        return null;

    }



    /**

     * @return always <code>null</code>

     */

    public TokenStream tokenStreamValue() {

        return null;

    }



    /**

     * Checks whether the text extraction task has finished.

     *

     * @return <code>true</code> if the extracted text is available

     */

    public boolean isExtractorFinished() {

        return extract != null;

    }



    private synchronized void setExtractedText(String value) {

        extract = value;

        notify();

    }



    /**

     * Releases all resources associated with this field.

     */

    public void dispose() {

        // TODO: Cause the ContentHandler below to throw an exception

    }



    /**

     * The background task for extracting text from a binary value.

     */

    private class ParsingTask extends DefaultHandler implements Runnable {



        private final Parser parser;



        private final InternalValue value;



        private final Metadata metadata;



        private final int maxFieldLength;



        private final StringBuilder builder = new StringBuilder();



        public ParsingTask(

                Parser parser, InternalValue value, Metadata metadata,

                int maxFieldLength) {

            this.parser = parser;

            this.value = value;

            this.metadata = metadata;

            this.maxFieldLength = maxFieldLength;

        }



        public void run() {

            try {

                InputStream stream = value.getStream();

                try {

                    parser.parse(stream, this, metadata, new ParseContext());

                } finally {

                    stream.close();

                }

            } catch (Throwable t) {

                if (t != STOP) {

                    log.warn("Failed to extract text from a binary property", t);

                }

            } finally {

                value.discard();

            }

            setExtractedText(builder.toString());

        }



        @Override

        public void characters(char[] ch, int start, int length)

                throws SAXException {

            builder.append(

                    ch, start,

                    Math.min(length, maxFieldLength - builder.length()));

            if (builder.length() >= maxFieldLength) {

                throw STOP;

            }

        }



        @Override

        public void ignorableWhitespace(char[] ch, int start, int length)

                throws SAXException {

            characters(ch, start, length);

        }



    }



}

 

从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的

这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利

jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面

线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本

需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类

JackrabbitParser类的源码如下:

/**

 * Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser}

 * for all parsing requests, but sets it up with Jackrabbit-specific

 * configuration and implements backwards compatibility support for old

 * <code>textExtractorClasses</code> configurations.

 *

 * @since Apache Jackrabbit 2.0

 */

class JackrabbitParser implements Parser {



    /**

     * Logger instance.

     */

    private static final Logger logger =

        LoggerFactory.getLogger(JackrabbitParser.class);



    /**

     * Flag for blocking all text extraction. Used by the Jackrabbit test suite.

     */

    private static volatile boolean blocked = false;



    /**

     * The configured Tika parser.

     */

    private final AutoDetectParser parser;



    /**

     * Creates a parser using the default Jackrabbit-specific configuration

     * settings.

     */

    public JackrabbitParser() {

        InputStream stream =

            JackrabbitParser.class.getResourceAsStream("tika-config.xml");

        try {

            if (stream != null) {

                try {

                    parser = new AutoDetectParser(new TikaConfig(stream));

                } finally {

                    stream.close();

                }

            } else {

                parser = new AutoDetectParser();

            }

        } catch (Exception e) {

            // Should never happen

            throw new RuntimeException(

                    "Unable to load embedded Tika configuration", e);

        }

    }



    /**

     * Backwards compatibility method to support old Jackrabbit 1.x

     * <code>textExtractorClasses</code> configurations. Implements a best

     * effort mapping from the old-style text extractor classes to

     * corresponding Tika parsers.

     *

     * @param classes configured list of text extractor classes

     */

    public void setTextFilterClasses(String classes) {

        Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>();



        StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");

        while (tokenizer.hasMoreTokens()) {

            String name = tokenizer.nextToken();

            if (name.equals(

                    "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {

                parsers.put(MediaType.text("html"), new HtmlParser());

            } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {

                Parser parser = new OfficeParser();

                parsers.put(MediaType.application("vnd.ms-excel"), parser);

                parsers.put(MediaType.application("msexcel"), parser);

                parsers.put(MediaType.application("excel"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) {

                parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser());

            } else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor")

                    || name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) {

                Parser parser = new OfficeParser();

                parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);

                parsers.put(MediaType.application("mspowerpoint"), parser);

                parsers.put(MediaType.application("powerpoint"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) {

                Parser parser = new OfficeParser();

                parsers.put(MediaType.application("vnd.ms-word"), parser);

                parsers.put(MediaType.application("msword"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) {

                Parser parser = new OfficeParser();

                parsers.put(MediaType.application("vnd.ms-word"), parser); 

                parsers.put(MediaType.application("msword"), parser);

                parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);

                parsers.put(MediaType.application("mspowerpoint"), parser);

                parsers.put(MediaType.application("vnd.ms-excel"), parser);

                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser);

                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser);

                parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {

                Parser parser = new OpenDocumentParser();

                parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser);

                parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser);

                parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser);

                parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser);

                parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser);

                parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser);

                parsers.put(MediaType.application("vnd.sun.xml.calc"), parser);

                parsers.put(MediaType.application("vnd.sun.xml.draw"), parser);

                parsers.put(MediaType.application("vnd.sun.xml.impress"), parser);

                parsers.put(MediaType.application("vnd.sun.xml.writer"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) {

                parsers.put(MediaType.application("pdf"), new PDFParser());

            } else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) {

                parsers.put(MediaType.TEXT_PLAIN, new TXTParser());

            } else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) {

                Parser parser = new ImageParser();

                parsers.put(MediaType.image("png"), parser);

                parsers.put(MediaType.image("apng"), parser);

                parsers.put(MediaType.image("mng"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) {

                Parser parser = new RTFParser();

                parsers.put(MediaType.application("rtf"), parser);

                parsers.put(MediaType.text("rtf"), parser);

            } else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) {

                Parser parser = new XMLParser();

                parsers.put(MediaType.APPLICATION_XML, parser);

                parsers.put(MediaType.text("xml"), parser);

            } else {

                logger.warn("Ignoring unknown text extractor class: {}", name);

            }

        }



        parser.setParsers(parsers);

    }



    /**

     * Delegates the call to the configured {@link AutoDetectParser}.

     */

    public Set<MediaType> getSupportedTypes(ParseContext context) {

        return parser.getSupportedTypes(context);

    }



    /**

     * Delegates the call to the configured {@link AutoDetectParser}.

     */

    public void parse(

            InputStream stream, ContentHandler handler,

            Metadata metadata, ParseContext context)

            throws IOException, SAXException, TikaException {

        waitIfBlocked();

        parser.parse(stream, handler, metadata, context);

    }



    public void parse(

            InputStream stream, ContentHandler handler, Metadata metadata)

            throws IOException, SAXException, TikaException {

        parse(stream, handler, metadata, new ParseContext());

    }



    /**

     * Waits until text extraction is no longer blocked. The block is only

     * ever activated in the Jackrabbit test suite when testing delayed

     * text extraction.

     *

     * @throws TikaException if the block was interrupted

     */

    private synchronized static void waitIfBlocked() throws TikaException {

        try {

            while (blocked) {

                JackrabbitParser.class.wait();

            }

        } catch (InterruptedException e) {

            throw new TikaException("Text extraction block interrupted", e);

        }

    }



    /**

     * Blocks all text extraction tasks.

     */

    static synchronized void block() {

        blocked = true;

    }



    /**

     * Unblocks all text extraction tasks.

     */

    static synchronized void unblock() {

        blocked = false;

        JackrabbitParser.class.notifyAll();

    }



}

 

具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)

---------------------------------------------------------------------------

本系列Apache Jackrabbit源码研究系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html

你可能感兴趣的:(apache)