Apache Tika源码研究(二)

上文分析了Apache Tika的编码识别相关接口和实现类

本文接着分析Apache Tika用到的一个关键类ParseContext,这里要明白Tika解析文档的方式,Tika将文件都解析为XHTML格式的文档,然后采用SAX基于事件的方式来解析这个XHTML格式,先来看看ParseContext类的源码:

public class ParseContext implements Serializable {



    /** Serial version UID. */

    private static final long serialVersionUID = -5921436862145826534L;



    /** Map of objects in this context */

    private final Map<String, Object> context = new HashMap<String, Object>();

 

    /**

     * Adds the given value to the context as an implementation of the given

     * interface.

     *

     * @param key the interface implemented by the given value

     * @param value the value to be added, or <code>null</code> to remove

     */

    public <T> void set(Class<T> key, T value) {

        if (value != null) {

            context.put(key.getName(), value);

        } else {

            context.remove(key.getName());

        }

    }



    /**

     * Returns the object in this context that implements the given interface.

     *

     * @param key the interface implemented by the requested object

     * @return the object that implements the given interface,

     *         or <code>null</code> if not found

     */

    @SuppressWarnings("unchecked")

    public <T> T get(Class<T> key) {

        return (T) context.get(key.getName());

    }



    /**

     * Returns the object in this context that implements the given interface,

     * or the given default value if such an object is not found.

     *

     * @param key the interface implemented by the requested object

     * @param defaultValue value to return if the requested object is not found

     * @return the object that implements the given interface,

     *         or the given default value if not found

     */

    public <T> T get(Class<T> key, T defaultValue) {

        T value = get(key);

        if (value != null) {

            return value;

        } else {

            return defaultValue;

        }

    }



    /**

     * Returns the SAX parser specified in this parsing context. If a parser

     * is not explicitly specified, then one is created using the specified

     * or the default SAX parser factory.

     *

     * @see #getSAXParserFactory()

     * @since Apache Tika 0.8

     * @return SAX parser

     * @throws TikaException if a SAX parser could not be created

     */

    public SAXParser getSAXParser() throws TikaException {

        SAXParser parser = get(SAXParser.class);

        if (parser != null) {

            return parser;

        } else {

            try {

                return getSAXParserFactory().newSAXParser();

            } catch (ParserConfigurationException e) {

                throw new TikaException("Unable to configure a SAX parser", e);

            } catch (SAXException e) {

                throw new TikaException("Unable to create a SAX parser", e);

            }

        }

    }



    /**

     * Returns the SAX parser factory specified in this parsing context.

     * If a factory is not explicitly specified, then a default factory

     * instance is created and returned. The default factory instance is

     * configured to be namespace-aware and to use

     * {@link XMLConstants#FEATURE_SECURE_PROCESSING secure XML processing}.

     *

     * @since Apache Tika 0.8

     * @return SAX parser factory

     */

    public SAXParserFactory getSAXParserFactory() {

        SAXParserFactory factory = get(SAXParserFactory.class);

        if (factory == null) {

            factory = SAXParserFactory.newInstance();

            factory.setNamespaceAware(true);

            try {

                factory.setFeature(

                        XMLConstants.FEATURE_SECURE_PROCESSING, true);

            } catch (ParserConfigurationException e) {

            } catch (SAXNotSupportedException e) {

            } catch (SAXNotRecognizedException e) {

                // TIKA-271: Some XML parsers do not support the

                // secure-processing feature, even though it's required by

                // JAXP in Java 5. Ignoring the exception is fine here, as

                // deployments without this feature are inherently vulnerable

                // to XML denial-of-service attacks.

            }

        }

        return factory;

    }



}

从该类的源码可以看出,ParseContext类的主要作用是获取XML的SAX解析类SAXParser

如果了解JAXP,上面的源码是很容易看懂的,Tika是采用SAX方式解析XML格式文档的,SAXParserFactory为抽象类,具体采用的哪个实现类呢,待分析

 

你可能感兴趣的:(apache)