org.dom4j.DocumentException: Error on line 1 of document: 前言中不允许有内容

下面是大致的异常栈:

org.dom4j.DocumentException: Error on line 1 of document  : 前言中不允许有内容。 Nested exception: 前言中不允许有内容。
	at org.dom4j.io.SAXReader.read(SAXReader.java:482)
	at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
	at com.apobates.parser.RssParser.build(RssParser.java:38)
	at com.apobates.machine.reader.Reader.mainParser(Reader.java:57)
	at com.apobates.machine.reader.Reader.load(Reader.java:37)
	at com.apobates.test.ParserEntityTest.main(ParserEntityTest.java:41)
Nested exception: 
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 前言中不允许有内容。
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.dom4j.io.SAXReader.read(SAXReader.java:465)
	at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
	at com.apobates.parser.RssParser.build(RssParser.java:38)
	at com.apobates.machine.reader.Reader.mainParser(Reader.java:57)
	at com.apobates.machine.reader.Reader.load(Reader.java:37)
	at com.apobates.test.ParserEntityTest.main(ParserEntityTest.java:41)
Nested exception: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 前言中不允许有内容。
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.dom4j.io.SAXReader.read(SAXReader.java:465)
	at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
	at com.apobates.parser.RssParser.build(RssParser.java:38)
	at com.apobates.machine.reader.Reader.mainParser(Reader.java:57)
	at com.apobates.machine.reader.Reader.load(Reader.java:37)
	at com.apobates.test.ParserEntityTest.main(ParserEntityTest.java:41)

上面的异常发生在我用HttpGet把 http://answers.microsoft.com/en-us/feed/f/ie的响应放到一个字符串中,在RssParser的build方法中运行下面代码:

Document doc=DocumentHelper.parseText(responseText);

时抛出的,在控制台上print出来的字符串另存为xml并没有发现xml是非良构的,我就开始找“前言中不允许有内容”,并没有找到xml中存在前言的概念,不过在google中却发现这个异常提示与之接近:Content is not allowed in prolog.

翻译一下是说:内容是不允许在序言。这下有头绪了。下面看一看xml序言有哪些内容:

The prolog refers to the information that appears before the start tag of the document or root element. It includes information that applies to the document as a whole, such as character encoding, document structure, and style sheets.






上面的定义来自MSDN:http://msdn.microsoft.com/en-us/library/vstudio/ms256037(v=vs.100).aspx

answers的feed xml的序言有:

Internet Explorer Category - All Threadsen-us

xml version肯定没问题,后面是rss,好像发现跟名称空间有点关系,我用正则替换掉item中的a10:发现问题解决了,最离谱的是如果用下面的代码也没问题:

		SAXReader xmlReader = new SAXReader();
		List rs=new ArrayList();
		try {
			Document doc=xmlReader.read(new URL("http://answers.microsoft.com/en-us/feed/f/ie"));
			
			List list = doc.selectNodes("//item");
                       //ETC
		} catch (MalformedURLException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (DocumentException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

dom4J对DocumentHelper.parseText要求过严了,一直想在这行代码之前关闭掉名称空间检查,苦读几天api,也没发现

你可能感兴趣的:(rss,java,dom4j)