高效解析XML是编码时经常用到的一块功能。在Java的世界当中,有三种处理XML的方式:DOM, SAX, StAX。网上对这三种解析模式也有了大量的说明。那么这三种解析方式在实际使用时到底各有什么特点呢?让我们通过三个实例来进行横向的比较。
首先我们创建一个xml文件,命名为data.xml:
<?xml version="1.0" encoding="UTF-8"?>
<greetings>
<greeting id="g1">Hello DOM</greeting>
<greeting id="g2">Hello SAX</greeting>
<greeting id="g3">Hello StAX</greeting>
</greetings>
首先我们用DOM方式来解析这个XML。DOM的特点是一次性把XML读进内存,并按DOM结构将XML数据映射成Java对象。下面这段代码调用org.w3c.dom.*来解析xml:
package org.bluedash;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class TryDom {
public static void main(String args[]) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(“data.xml”);
Element elem = doc.getDocumentElement();
NodeList list = elem.getChildNodes();
for (int i = 0; i < list.getLength(); i++) {
Node node = list.item(i);
NamedNodeMap attributes = node.getAttributes();
if (attributes != null) {
for (int j = 0; j < attributes.getLength(); j++) {
Node attr = attributes.item(j);
System.out.println("attr name: " + attr.getNodeName());
System.out.println("attr value: " + attr.getNodeValue());
}
}
System.out.println("node name: " + node.getNodeName());
System.out.println("node type: " + node.getNodeType());
System.out.println("node value: " + node.getNodeType());
System.out.println("content: " + node.getTextContent());
System.out.println(“—————————”);
}
}
}
代码输出结果如下:
node name: #text
node type: 3
node value: 3
content:
---------------------
attr name: id
attr value: g1
node name: greeting
node type: 1
node value: 1
content: Hello DOM
---------------------
node name: #text
node type: 3
node value: 3
content:
---------------------
attr name: id
attr value: g2
node name: greeting
node type: 1
node value: 1
content: Hello SAX
---------------------
node name: #text
node type: 3
node value: 3
content:
---------------------
attr name: id
attr value: g3
node name: greeting
node type: 1
node value: 1
content: Hello StAX
---------------------
node name: #text
node type: 3
node value: 3
content:
---------------------
请注意node type:3是代表的是空格。DOM会把greeting元素间的空白也算做独立的内容。通过上述样例我们可以发现DOM的特点就是一次性把XML数据读入内存并按照DOM约定的结构创建相关的实例。对于尺寸较小的XML文件,使用DOM来进行解析还是非常方便的,但如果XML的文件尺寸比较大,用DOM方式进行解析的效率就比较低,对内存资源的浪费也比较大,因此我们需要以"流"的方式来解析XML。SAX和StAX正是这样的工具。
首先来看SAX:
package org.bluedash;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class TrySAX extends HandlerBase {
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String value = new String(ch, start, length);
if (!value.trim().equals("")) {
System.out.println("Text: " + value);
}
}
@Override
public void endDocument() throws SAXException {
System.out.println("End Document");
super.endDocument();
}
@Override
public void endElement(String name) throws SAXException {
System.out.println("End Element:" + name);
super.endElement(name);
}
@Override
public void startDocument() throws SAXException {
System.out.println("Start Document.");
super.startDocument();
}
@Override
public void startElement(String name, AttributeList attributes)
throws SAXException {
System.out.println("Start Element: " + name);
for (int i = 0, n = attributes.getLength(); i < n; ++i)
System.out.println("Attribute: " + attributes.getName(i) + "="
+ attributes.getValue(i));
super.startElement(name, attributes);
}
public static void main(String args[]) throws Exception {
InputStreamReader reader = new InputStreamReader(new FileInputStream(
new File("data.xml")));
InputSource source = new InputSource(reader);
HandlerBase handler = new TrySAX();
SAXParserFactory factory = SAXParserFactory.newInstance();
String parserClassName = "javax.xml.parsers.SAXParser";
SAXParser parser = factory.newSAXParser();
parser.parse(source, handler);
}
}
执行上述程序,结果输出如下:
Start Document.
Start Element: greetings
Start Element: greeting
Attribute: id=g1
Text: Hello DOM
End Element:greeting
Start Element: greeting
Attribute: id=g2
Text: Hello SAX
End Element:greeting
Start Element: greeting
Attribute: id=g3
Text: Hello StAX
End Element:greeting
End Element:greetings
End Document
SAX方式把XML用流的方式读入,并在把XML的相关元素分解成一系列事件。当遇见某一事件时,触发这个事件对应的方法。这样,我们在事件对应的方法中,撰写我们所需的业务处理逻辑即可。但这样写程序有点怪,我们的业务逻辑代码必须要封装到这些事件所在的HandlerBase中,而不是我们所期望的业务逻辑的Class当中。我们称这样的封装方法为“推送”[2]的方法。
那么有没有可能,我们不把业务逻辑放在事件方法中,而是我们调用Handler来处理XML呢?答案是:YES。StAX就是以后一种形式工作的。与SAX不同,StAX采用"拉"[3]的方法来处理XML。也是通过一段样例来说明StAX的使用方法:
package org.bluedash;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
public class TryCursorMode {
private void parseXML() throws IOException, XMLStreamException {
InputStream in = new FileInputStream("data.xml");
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLStreamReader r = inFactory.createXMLStreamReader(in);
try {
int event = r.getEventType();
while (true) {
switch (event) {
case XMLStreamConstants.START_DOCUMENT:
System.out.println("Start Document.");
break;
case XMLStreamConstants.START_ELEMENT:
System.out.println("Start Element: " + r.getName());
for (int i = 0, n = r.getAttributeCount(); i < n; ++i)
System.out.println("Attribute: "
+ r.getAttributeName(i) + "="
+ r.getAttributeValue(i));
break;
case XMLStreamConstants.CHARACTERS:
if (r.isWhiteSpace())
break;
System.out.println("Text: " + r.getText());
break;
case XMLStreamConstants.END_ELEMENT:
System.out.println("End Element:" + r.getName());
break;
case XMLStreamConstants.END_DOCUMENT:
System.out.println("End Document.");
break;
}
if (!r.hasNext())
break;
event = r.next();
}
} finally {
r.close();
}
}
public static void main(String args[]) throws Exception {
TryCursorMode demo = new TryCursorMode();
demo.parseXML();
}
}
执行这段程序,我们可以得到结果如下:
Start Document.
Start Element: greetings
Start Element: greeting
Attribute: id=g1
Text: Hello DOM
End Element:greeting
Start Element: greeting
Attribute: id=g2
Text: Hello SAX
End Element:greeting
Start Element: greeting
Attribute: id=g3
Text: Hello StAX
End Element:greeting
End Element:greetings
End Document.
可以看到,StAX的API的设计思路与SAX是非常不同的,通过StAX,处理XML的逻辑被转移到了我们自己的主逻辑代码中。
通过以上三段代码,我们可以看到三种XML的处理方式的区别。有关这三种方式,还有非常详细深入的话题可以展开,如果有兴趣进一步学习,可以查看参考资料中的相关内容。
我将文章中用到的代码放在了github里,有兴趣可以clone一份动手玩玩看:
git clone git://github.com/liweinan/try-xml.git
代码下载完成后,首先需要编译代码:
mvn install
然后就可以使用下面的三个命令分别运行DOM,SAX,StAX的例子:
mvn exec:java -Dexec.mainClass="net.bluedash.xml.TryDom"
mvn exec:java -Dexec.mainClass="net.bluedash.xml.TrySax"
mvn exec:java -Dexec.mainClass="net.bluedash.xml.TryCursorMode"
参考资料:
[1] http://onjava.com/pub/a/onjava/2001/02/08/dom.html
[2] http://developerlife.com/tutorials/?p=29
[3] http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232
[4] http://www.ibm.com/developerworks/xml/library/x-stax1.html
[5] http://www.ibm.com/developerworks/xml/library/x-tipstx2/
[6] http://www.xml.com/pub/a/2003/09/17/stax.html
注解:
[1] Stream
[2] push
[3] pull