XML解析的三种方式

高效解析XML是编码时经常用到的一块功能。在Java的世界当中,有三种处理XML的方式:DOM, SAX, StAX。网上对这三种解析模式也有了大量的说明。那么这三种解析方式在实际使用时到底各有什么特点呢?让我们通过三个实例来进行横向的比较。

首先我们创建一个xml文件,命名为data.xml:

<?xml version="1.0" encoding="UTF-8"?>
<greetings>
	<greeting id="g1">Hello DOM</greeting>
	<greeting id="g2">Hello SAX</greeting>
	<greeting id="g3">Hello StAX</greeting>
</greetings>


首先我们用DOM方式来解析这个XML。DOM的特点是一次性把XML读进内存,并按DOM结构将XML数据映射成Java对象。下面这段代码调用org.w3c.dom.*来解析xml:


package org.bluedash;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class TryDom {
	public static void main(String args[]) throws Exception {
		DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
		DocumentBuilder builder = factory.newDocumentBuilder();
		Document doc = builder.parse(“data.xml”);
		Element elem = doc.getDocumentElement();
		NodeList list = elem.getChildNodes();
		for (int i = 0; i < list.getLength(); i++) {
			Node node = list.item(i);
			NamedNodeMap attributes = node.getAttributes();
			if (attributes != null) {
				for (int j = 0; j < attributes.getLength(); j++) {
					Node attr = attributes.item(j);
					System.out.println("attr name: " + attr.getNodeName());
					System.out.println("attr value: " + attr.getNodeValue());
				}
			}
			System.out.println("node name: " + node.getNodeName());
			System.out.println("node type: " + node.getNodeType());
			System.out.println("node value: " + node.getNodeType());
			System.out.println("content: " + node.getTextContent());
			System.out.println(“—————————”);
		}
	}
}


代码输出结果如下:


node name: #text
node type: 3
node value: 3
content: 
	
---------------------
attr name: id
attr value: g1
node name: greeting
node type: 1
node value: 1
content: Hello DOM
---------------------
node name: #text
node type: 3
node value: 3
content: 
	
---------------------
attr name: id
attr value: g2
node name: greeting
node type: 1
node value: 1
content: Hello SAX
---------------------
node name: #text
node type: 3
node value: 3
content: 
	
---------------------
attr name: id
attr value: g3
node name: greeting
node type: 1
node value: 1
content: Hello StAX
---------------------
node name: #text
node type: 3
node value: 3
content: 

---------------------


请注意node type:3是代表的是空格。DOM会把greeting元素间的空白也算做独立的内容。通过上述样例我们可以发现DOM的特点就是一次性把XML数据读入内存并按照DOM约定的结构创建相关的实例。对于尺寸较小的XML文件,使用DOM来进行解析还是非常方便的,但如果XML的文件尺寸比较大,用DOM方式进行解析的效率就比较低,对内存资源的浪费也比较大,因此我们需要以"流"的方式来解析XML。SAX和StAX正是这样的工具。

首先来看SAX:


package org.bluedash;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class TrySAX extends HandlerBase {

	@Override
	public void characters(char[] ch, int start, int length)
			throws SAXException {
		String value = new String(ch, start, length);
		if (!value.trim().equals("")) {
			System.out.println("Text: " + value);
		}
	}

	@Override
	public void endDocument() throws SAXException {
		System.out.println("End Document");
		super.endDocument();
	}

	@Override
	public void endElement(String name) throws SAXException {
		System.out.println("End Element:" + name);
		super.endElement(name);
	}

	@Override
	public void startDocument() throws SAXException {
		System.out.println("Start Document.");
		super.startDocument();
	}

	@Override
	public void startElement(String name, AttributeList attributes)
			throws SAXException {
		System.out.println("Start Element: " + name);
		for (int i = 0, n = attributes.getLength(); i < n; ++i)
			System.out.println("Attribute: " + attributes.getName(i) + "="
					+ attributes.getValue(i));
		super.startElement(name, attributes);
	}

	public static void main(String args[]) throws Exception {
		InputStreamReader reader = new InputStreamReader(new FileInputStream(
				new File("data.xml")));

		InputSource source = new InputSource(reader);
		HandlerBase handler = new TrySAX();

		SAXParserFactory factory = SAXParserFactory.newInstance();
		String parserClassName = "javax.xml.parsers.SAXParser";
		SAXParser parser = factory.newSAXParser();

		parser.parse(source, handler);

	}
}


执行上述程序,结果输出如下:

Start Document.
Start Element: greetings
Start Element: greeting
Attribute: id=g1
Text: Hello DOM
End Element:greeting
Start Element: greeting
Attribute: id=g2
Text: Hello SAX
End Element:greeting
Start Element: greeting
Attribute: id=g3
Text: Hello StAX
End Element:greeting
End Element:greetings
End Document


SAX方式把XML用流的方式读入,并在把XML的相关元素分解成一系列事件。当遇见某一事件时,触发这个事件对应的方法。这样,我们在事件对应的方法中,撰写我们所需的业务处理逻辑即可。但这样写程序有点怪,我们的业务逻辑代码必须要封装到这些事件所在的HandlerBase中,而不是我们所期望的业务逻辑的Class当中。我们称这样的封装方法为“推送”[2]的方法。

那么有没有可能,我们不把业务逻辑放在事件方法中,而是我们调用Handler来处理XML呢?答案是:YES。StAX就是以后一种形式工作的。与SAX不同,StAX采用"拉"[3]的方法来处理XML。也是通过一段样例来说明StAX的使用方法:

package org.bluedash;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

public class TryCursorMode {

	private void parseXML() throws IOException, XMLStreamException {

		InputStream in = new FileInputStream("data.xml");
		XMLInputFactory inFactory = XMLInputFactory.newInstance();
		XMLStreamReader r = inFactory.createXMLStreamReader(in);

		try {
			int event = r.getEventType();
			while (true) {
				switch (event) {
				case XMLStreamConstants.START_DOCUMENT:
					System.out.println("Start Document.");
					break;
				case XMLStreamConstants.START_ELEMENT:
					System.out.println("Start Element: " + r.getName());
					for (int i = 0, n = r.getAttributeCount(); i < n; ++i)
						System.out.println("Attribute: "
								+ r.getAttributeName(i) + "="
								+ r.getAttributeValue(i));

					break;
				case XMLStreamConstants.CHARACTERS:
					if (r.isWhiteSpace())
						break;

					System.out.println("Text: " + r.getText());
					break;
				case XMLStreamConstants.END_ELEMENT:
					System.out.println("End Element:" + r.getName());
					break;
				case XMLStreamConstants.END_DOCUMENT:
					System.out.println("End Document.");
					break;
				}

				if (!r.hasNext())
					break;

				event = r.next();
			}
		} finally {
			r.close();
		}

	}

	public static void main(String args[]) throws Exception {
		TryCursorMode demo = new TryCursorMode();
		demo.parseXML();

	}
}


执行这段程序,我们可以得到结果如下:


Start Document.
Start Element: greetings
Start Element: greeting
Attribute: id=g1
Text: Hello DOM
End Element:greeting
Start Element: greeting
Attribute: id=g2
Text: Hello SAX
End Element:greeting
Start Element: greeting
Attribute: id=g3
Text: Hello StAX
End Element:greeting
End Element:greetings
End Document.


可以看到,StAX的API的设计思路与SAX是非常不同的,通过StAX,处理XML的逻辑被转移到了我们自己的主逻辑代码中。

通过以上三段代码,我们可以看到三种XML的处理方式的区别。有关这三种方式,还有非常详细深入的话题可以展开,如果有兴趣进一步学习,可以查看参考资料中的相关内容。

我将文章中用到的代码放在了github里,有兴趣可以clone一份动手玩玩看:

git clone git://github.com/liweinan/try-xml.git


代码下载完成后,首先需要编译代码:

mvn install


然后就可以使用下面的三个命令分别运行DOM,SAX,StAX的例子:

mvn exec:java -Dexec.mainClass="net.bluedash.xml.TryDom"


mvn exec:java -Dexec.mainClass="net.bluedash.xml.TrySax"


mvn exec:java -Dexec.mainClass="net.bluedash.xml.TryCursorMode"


参考资料:

[1] http://onjava.com/pub/a/onjava/2001/02/08/dom.html
[2] http://developerlife.com/tutorials/?p=29
[3] http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232
[4] http://www.ibm.com/developerworks/xml/library/x-stax1.html
[5] http://www.ibm.com/developerworks/xml/library/x-tipstx2/
[6] http://www.xml.com/pub/a/2003/09/17/stax.html

注解:

[1] Stream

[2] push

[3] pull

你可能感兴趣的:(xml,dom,sax,StAX)