There are many kinds of XML Parsers in Java:
- DOM (JDK embedded DOM implementation)
- SAX
- JDOM (It is an alternative to DOM and SAX)
- Digester (Jakarta commons Digester)
- JAXB(OXM, JDK1.6 embedded JAXB2.0 implementation)
- dom4j
- Xerces
- KXML
- ...
In fact, you can list more if google "xmlparser java". However I only list what I known or used in my previous projects. I will talk about them one by one and in this blog I would like to only talk about DOM + XPath.
As usual, why I want to talk about such kind of "OLD" questions for me? The fact is that, I am re-factoring a platform which parsing OMADM DDF DTD with DOM and I cannot answer myself the question:
- What's the difference between DOM Node and DOM Element?
Node Types
After study some documents which I am 100% sure I had read several years ago, I got the old answer for that:
- The Node object represents a single node in document tree.
- There are many types of Node used to represent dedicated architecture of XML document.
NodeType |
Description |
Children Nodes |
Element |
Represents an element |
Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference |
Attr |
Represents an attribute |
Text, EntityReference |
Text |
Represents textual content in an element or attribute |
None |
CDATASection |
Represents a CDATA section in a document (text that will NOT be parsed by a parser) |
None |
Document |
Represents the entire document (the root-node of the DOM tree) |
Element (max. one), ProcessingInstruction, Comment, DocumentType |
Node Types table (description and relationship with each other)
NodeType
|
Named Constant |
NodeType Constant |
getNodeName() return |
getNodeValue() return |
Element |
ELEMENT_NODE |
1 |
Element name/ tagName |
Null |
Attr |
ATTRIBUTE_NODE |
2 |
Attribute name |
Attribute value |
Text |
TEXT_NODE |
3 |
#text |
content of node |
CDATASection |
CDATA_SECTION_NODE |
4 |
#cdata-section |
content of node |
Document |
DOCUMENT_NODE |
9 |
#document |
null |
Node Types table (basic properties)
- Element is a kind of Node or It is sub-class of Node interface in Java point of view.
- If a Node has NodeType ==1, we can say it is a Element.
- Element.getTagName equals to Element.getNodeName().
JDK embedded DOM Parser
Normally we can get etire XML Document object by using following java code ():
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(false);
// factory.setSchema (myschema);
DocumentBuilder parser = factory.newDocumentBuilder();
// parser.setEntityResolver (new MyEntityResolver ());
// parser.setErrorHandler (new MyParseErrorHandler ());
org.w3c.dom.Document document = parser.parse(InputStream|String|File|InputSource);
Then you can use org.w3c.dom.Document.getDocumentElement() to get Root element. Why? From Document Node description, we know that Document represent the entire document and its include a unique Element node. So here we can easily get Root Element by calling Document.getDocumentElement().
Once we got the Root element, we can introspect it to parse XML data.
Useful interfaces for DOM parser
Here only list the key methods and that used often, for other methos please refer to javadoc.
Node.getNodeType(): short
Get node type (see previous Node Types table). It is very useful for your parser, because different types of Nodes will provide you important data (by dedicated method invocation).
Node.getChildNodes(): NodeList
Get a NodeList that contains all children of this node.
Node.getAttributes(): NamedNodeMap
Get a NamedNodeMap containing the attributes of this node (if it is an Element) or null otherwise.
Node.getNodeName():String
See previous Node Types table.
Node.getNodeValue():String
See previous Node Types table.
Node.getTextContent(): String
Return text content of this node and its descendants.
If current Node is a Root element, getTextContent() will return all the text content inside document.
Element.getElementsByTagName(String name):NodeList
Returns a NodeList of all descendant Elements with a given tag name, in document order.
Introspect XML with DOM defined methods
We can parse the XML document now node by node with previous interfaces:
Step 1: Use DocumentBuilder to load XML as Document object;
Step 2: Get Root Element and get its ChildNodes List.
Step 3: loop each Node in ChildNodes (Upper level) to Check Node Type and parse it with your business logical. if there are child nodes (Lower level) for current Node, pause current Node parsing and loop lower level child nodes until all of them processed and weekup Upper level Node processing.
Step 4: if all of the childs node processed, should reach the end of XML document.
Just draft summarize, will be updated later.
Locate Node with XPath
From previous design, we have to loop a lots of Node if we just want to get a element text like following:
/bookstore/category/country/book/author.
Thanks to XPath,it can help us to locate Element easily by specify element as file path in file system.
You can create a XPath instance as following:
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath(); // Create a new XPath instance
Then You can get a NodeList with "Element tag = author":
XPathExpression xexpr = xpath.compile("/bookstore/category/country/book/author");
NodeList nodes = (NodeList) xexpr.evaluate(document, XPathConstants.NODESET);
You can also get a node as following:
Node node = (Node) xexpr.evaluate(document, XPathConstants.NODE);
But if you are not sure if the target Node is unique or not, try to get NodeList instead of unique Node.
Adventage and Disadventage:
As we can see from DOM parsing methods, DOM will load all the document into memory in order to let you loop different nodes easily. So it could be an issue when you design a system which could exchange big XML docuemtn file. In this case some other XML parser, like digester and some other SAX related parsers could be an alternative.
But I always think DOM provide flexible solution to parse XML defintion with a lot of self-reference element, like OMADM DDF node: Node(Node+). Using DOM, we can write our own recursion parser like what I talked in chapter "
Introspect XML with DOM defined methods".
Maybe some other parser has better solution with specific path expression that I donot know. So I will see.