6 加载页面的W3C DOM访问
6.1 mozdom4java库
访问W3C DOM树比访问Mozilla的DOM树要好,因为它是一个动态访问HTML和XML的DOM树的标准。为了实现这个,我们使用从Mozilla
DOM到W3C DOM的java Bridge。有一个叫做mozdom4java的项目http://mozdom4java.mozdev.org/index.html。
下载这个包后,我们把jar包放到classpath里。例如,我们增加一个按钮来抽取HTML文档里的所有链接。
// When that button is pressed, then we obtain the HTML document corresponding to
// the URL loaded in browser. Next, we extract all its child nodes with 'a' tag name
// and print its content.
final ToolItem anchorItem = new ToolItem(toolbar, SWT.PUSH);
anchorItem.setImage(getImage("resources/anchors.png"));
anchorItem.addSelectionListener(new SelectionAdapter() {
public void widgetSelected(SelectionEvent event) {
// First, we obtain a Mozilla DOM Document representation
nsIDOMDocument doc = browser.getDocument();
// Get all anchors from the loaded HTML document
nsIDOMNodeList nodeList = doc.getElementsByTagName("a");
for ( int i = 0; i < nodeList.getLength(); i++ ){
// Get Mozilla DOM node
nsIDOMNode mozNode = nodeList.item(i);
// Get the appropiate interface
nsIDOMHTMLAnchorElement mozAnchor =
(nsIDOMHTMLAnchorElement) mozNode.queryInterface(
nsIDOMHTMLAnchorElement.NS_IDOMHTMLANCHORELEMENT_IID);
// Get the corresponding W3C DOM node
HTMLAnchorElement a = (HTMLAnchorElement)
HTMLAnchorElementImpl.getDOMInstance(mozAnchor);
// Test the HTML element
System.out.println("Tag Name: " + a.getNodeName() + " -- Text: " + a.getTextContent()
+ " -- Href: " + a.getHref());
}
}
});
...
6.2 给mozdom4j打补丁来实现mozilla DOM Tree到 W3C DOM Tree的转换
如果我们总想使用W3C DOM Tree,节点的转换可能有点麻烦。我们建议修改mozdom4java。在我们看来,这些修改简化了代码,因为
我们可以忘掉Mozilla DOM节点。最后,当我们讨论XPath时evaluator将返回一个节点的list,操作W3C element比Mozilla的node方
便,换句话说,我们的目标是构建一个可用的web browser,用标准的方法使用它而不用知道Mozilla实现的任何知识。
首先,我们需要下载Java Language Binding for DOM Level 2规范。比较好的做法是下载mozdom4java项目的jar包,
http://www.mozdev.org/source/browse/mozdom4java/src/jars/,因为他们包含了所有需要的文件,包括手工的扩展,因此我们不
需要关心任何东西。此外,我们也需要Mozilla接口。所有需要的文件:
w3chtml.jar 包含了W3C DOM HTML level 2的接口,分成两个包 org.w3c.dom.html 和 org.w3c.dom.html2
w3cextension.jar 包含 KeyEvent 类于org.w3c.dom.events包中。
MozillaInterfaces.jar
MozillaGlue.jar
当你把这些jar包扔到classpaht后,mozdom4java应该可以很好的编译(没有错误,可能有一下警告)。下面我们将修改
mozdom4java的源代码。我们将逐个文件的解释这些修改。当然,你可以直接下载修改好的jar包。
要手工patch这些库,请follow下面的步骤:
我们将要创建一个HMTL element的factory,这个类能转换Mozilla DOM element节点为相应的W3C DOM element节点。下面的类就
做了这件事情并且包含了许多注释。它使用了java反射来做前面的事情,这种方式可以让你不需要知道任何Mozilla DOM节点。
注:代码虽然很长,其实非常简单,就是用反射来调用前面的getDOMInstance方法
package es.ladyr.dom;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.util.HashMap;
import java.util.Map;
import org.mozilla.interfaces.*;
import org.w3c.dom.html.HTMLElement;
public class HTMLElementFactory {
private static HTMLElementFactory instance;
private Map<String, String> corresp;
private HTMLElementFactory() {
initCorrespondence();
}
public static HTMLElementFactory getInstance(){
if(instance == null){
instance = new HTMLElementFactory();
}
return instance;
}
public static HTMLElement getHTMLElement(nsIDOMNode nsNode) {
return getInstance().getConcreteNode(nsNode);
}
private void initCorrespondence() {
corresp = new HashMap<String, String>();
corresp.put("a", "Anchor");
corresp.put("applet", "Applet");
corresp.put("area", "Area");
corresp.put("base", "Base");
corresp.put("basefont", "BaseFont");
corresp.put("body", "Body");
corresp.put("br", "BR");
corresp.put("button", "Button");
corresp.put("dir", "Directory");
corresp.put("div", "Div");
corresp.put("dl", "DList");
corresp.put("fieldset", "FieldSet");
corresp.put("font", "Font");
corresp.put("form", "Form");
corresp.put("frame", "Frame");
corresp.put("frameset", "FrameSet");
corresp.put("head", "Head");
corresp.put("h1", "Heading");
corresp.put("h2", "Heading");
corresp.put("h3", "Heading");
corresp.put("h4", "Heading");
corresp.put("h5", "Heading");
corresp.put("h6", "Heading");
corresp.put("hr", "HR");
corresp.put("html", "Html");
corresp.put("iframe", "IFrame");
corresp.put("img", "Image");
corresp.put("input", "Input");
corresp.put("isindex", "IsIndex");
corresp.put("label", "Label");
corresp.put("legend", "Legend");
corresp.put("li", "LI");
corresp.put("link", "Link");
corresp.put("map", "Map");
corresp.put("menu", "Menu");
corresp.put("meta", "Meta");
corresp.put("ins", "Mod");
corresp.put("del", "Mod");
corresp.put("object", "Object");
corresp.put("ol", "OList");
corresp.put("optgroup", "OptGroup");
corresp.put("option", "Option");
corresp.put("p", "Paragraph");
corresp.put("param", "Param");
corresp.put("pre", "Pre");
corresp.put("q", "Quote");
corresp.put("script", "Script");
corresp.put("select", "Select");
corresp.put("style", "Style");
corresp.put("caption", "TableCaption");
corresp.put("td", "TableCell");
corresp.put("col", "TableCol");
corresp.put("table", "Table");
corresp.put("tr", "TableRow");
corresp.put("thead", "TableSection");
corresp.put("tfoot", "TableSection");
corresp.put("tbody", "TableSection");
corresp.put("textarea", "TextArea");
corresp.put("title", "Title");
corresp.put("ul", "UList");
}
/**
* Try to convert a Mozilla DOM node into W3C DOM element.
*
* @param nsNode node to convert into W3C DOM element.
* @return W3C HTML element corresponding to a Mozilla DOM node.
*/
public HTMLElement getConcreteNode(nsIDOMNode nsNode) {
// Only converts element nodes. If the mozilla node
// isn't a Mozilla DOM element, we cannot convert into
// an W3C DOM element
if (nsNode.getNodeType() == nsIDOMNode.ELEMENT_NODE) {
// We use a hashmap to obtain element names from node names
String htmlElementType = corresp.get(nsNode.getNodeName()
.toLowerCase());
// If we don't know the element type, we cannot transform
// that node into W3C DOM element
if(htmlElementType == null){
return null;
}
// Compose the class name for the Mozilla DOM element.
String nsClassName = "org.mozilla.interfaces.nsIDOMHTML"
+ htmlElementType + "Element";
// Compose the field name for the element IID
String nsFieldInterfaceName = "NS_IDOMHTML"
+ htmlElementType.toUpperCase() + "ELEMENT_IID";
try {
// Once we have their names, obtain the class and the field
Class nsClass = Class.forName(nsClassName);
Field field = nsClass.getField(nsFieldInterfaceName);
// Get the field value (is a static field, so the argumentis ignored)
String iid = (String) field.get(null);
// Get the apropiate node interface
Object nsElement = nsNode.queryInterface(iid);
// Build the W3C DOM Element implementation class name
// (the package org.mozilla.dom.html contains concrete implementations
// for the W3C HTML element interfaces)
String w3cClassName = "org.mozilla.dom.html.HTML"
+ htmlElementType + "ElementImpl";
// Obtain the class for the corresponding W3C DOM Element implementation
Class w3cClass = Class.forName(w3cClassName);
// Extract the method that must be invoked to transform the element
Method creationMethod = w3cClass.getMethod("getDOMInstance", nsClass);
// Invokes getDOMInstance method of corresponding W3C HTML element
// which returns an instance of corresponding W3C HTML element
HTMLElement node = (HTMLElement) creationMethod.invoke(null, nsElement);
return node;
} catch (Exception e) {
throw new Error(e);
}
}
return null;
}
}
利用我们的HTMLElementFactory类,我们将要修改NodeFactory类。修改后你可以调用org.w3c.dom.Node getNodeInstance
(nsIDOMNode node),当输入是类型是nsIDOMNode.ELEMENT_NODE时,返回的是与之对应的W3C DOM element。
修改的代码如下:
...
// Import our factory to create W3C HTML elements from Mozilla DOM elements
import es.ladyr.dom.HTMLElementFactory;
...
public static Node getNodeInstance( nsIDOMNode node )
{
if (node == null) {
return null;
}
switch ( node.getNodeType() )
{
case nsIDOMNode.ELEMENT_NODE:
// Use our factory to obtain a W3C HTML DOM element
Node htmlElement = HTMLElementFactory.getHTMLElement(node);
if (htmlElement != null) {
return htmlElement;
} else {
// If factory cannot convert the concrete node (for instance,
// the type is unknown for our factory implementation), then
// returns a generic W3C DOM element
return ElementImpl.getDOMInstance((nsIDOMElement) node
.queryInterface(nsIDOMElement.NS_IDOMELEMENT_IID));
}
...
下面是NodeFactory 类的完整代码:
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS IS" basis,
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
* for the specific language governing rights and limitations under the
* License.
*
* The Original Code is mozdom4java
*
* The Initial Developer of the Original Code is
* Peter Szinek, Lixto Software GmbH, http://www.lixto.com.
* Portions created by the Initial Developer are Copyright (C) 2005-2006
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Peter Szinek ([email protected])
* Michal Ceresna ([email protected])
*
* Alternatively, the contents of this file may be used under the terms of
* either the GNU General Public License Version 2 or later (the "GPL"), or
* the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
* in which case the provisions of the GPL or the LGPL are applicable instead
* of those above. If you wish to allow use of your version of this file only
* under the terms of either the GPL or the LGPL, and not to allow others to
* use your version of this file under the terms of the MPL, indicate your
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */
import org.w3c.dom.Node;
import org.mozilla.dom.*;
import org.mozilla.interfaces.*;
public class NodeFactory
{
private NodeFactory()
{}
public static Node getNodeInstance( nsIDOMEventTarget eventTarget )
{
if (eventTarget == null ) {
return null;
}
nsIDOMNode node = (nsIDOMNode) eventTarget.queryInterface(nsIDOMNode.NS_IDOMNODE_IID);
return getNodeInstance(node);
}
public static Node getNodeInstance( nsIDOMNode node )
{
if (node == null) {
return null;
}
switch ( node.getNodeType() )
{
case nsIDOMNode.ELEMENT_NODE:
// Use our factory to obtain a W3C HTML DOM element
Node htmlElement = HTMLElementFactory.getHTMLElement(node);
if (htmlElement != null) {
return htmlElement;
} else {
// If factory cannot convert the concrete node (for instance,
// the type is unknown for our factory implementation), then
// returns a generic W3C DOM element
return ElementImpl.getDOMInstance((nsIDOMElement) node
.queryInterface(nsIDOMElement.NS_IDOMELEMENT_IID));
}
case nsIDOMNode.ATTRIBUTE_NODE: return AttrImpl.getDOMInstance((nsIDOMAttr) node.queryInterface
(nsIDOMAttr.NS_IDOMATTR_IID));
case nsIDOMNode.TEXT_NODE: return TextImpl.getDOMInstance((nsIDOMText) node.queryInterface
(nsIDOMText.NS_IDOMTEXT_IID));
case nsIDOMNode.CDATA_SECTION_NODE: return CDATASectionImpl.getDOMInstance((nsIDOMCDATASection)
node.queryInterface(nsIDOMCDATASection.NS_IDOMCDATASECTION_IID));
case nsIDOMNode.ENTITY_REFERENCE_NODE: return EntityReferenceImpl.getDOMInstance((nsIDOMEntityReference)
node.queryInterface(nsIDOMEntityReference.NS_IDOMENTITYREFERENCE_IID));
case nsIDOMNode.ENTITY_NODE: return EntityImpl.getDOMInstance((nsIDOMEntity) node.queryInterface
(nsIDOMEntity.NS_IDOMENTITY_IID));
case nsIDOMNode.PROCESSING_INSTRUCTION_NODE: return ProcessingInstructionImpl.getDOMInstance
((nsIDOMProcessingInstruction) node.queryInterface(nsIDOMProcessingInstruction.NS_IDOMPROCESSINGINSTRUCTION_IID));
case nsIDOMNode.COMMENT_NODE: return CommentImpl.getDOMInstance((nsIDOMComment) node.queryInterface
(nsIDOMComment.NS_IDOMCOMMENT_IID));
case nsIDOMNode.DOCUMENT_NODE: return DocumentImpl.getDOMInstance((nsIDOMDocument) node.queryInterface
(nsIDOMDocument.NS_IDOMDOCUMENT_IID));
case nsIDOMNode.DOCUMENT_TYPE_NODE: return DocumentTypeImpl.getDOMInstance((nsIDOMDocumentType)
node.queryInterface(nsIDOMDocumentType.NS_IDOMDOCUMENTTYPE_IID));
case nsIDOMNode.DOCUMENT_FRAGMENT_NODE: return DocumentFragmentImpl.getDOMInstance
((nsIDOMDocumentFragment) node.queryInterface(nsIDOMDocumentFragment.NS_IDOMDOCUMENTFRAGMENT_IID));
case nsIDOMNode.NOTATION_NODE: return NotationImpl.getDOMInstance((nsIDOMNotation) node.queryInterface
(nsIDOMNotation.NS_IDOMNOTATION_IID));
default: return NodeImpl.getDOMInstance(node);
}
}
public static nsIDOMNode getnsIDOMNode( Node node )
{
if (node instanceof NodeImpl) {
NodeImpl ni = (NodeImpl) node;
return ni.getInstance();
}
else {
return null;
}
}
private static boolean toLower = true;
public static boolean getConvertNodeNamesToLowerCase()
{
return toLower;
}
public static void setConvertNodeNamesToLowerCase(boolean convert)
{
toLower = convert;
}
private static boolean expandFrames = false;
public static boolean getExpandFrames()
{
return expandFrames;
}
public static void setExpandFrames(boolean expand)
{
expandFrames = expand;
}
}
最后,我们需要修改ElementImpl类。这个类有两个方法, public String getAttribute(String name) 和 public String
getTagName() ,这个两个方法最后会调用toLowerCase来把结果变成小写。这可能会带来问题,比如,一个anchor的属性可能是
onclick,这个属性的值可能包含JavaScript代码。如果我们需要执行这段JavaScript代码,那么可能会有问题。所以我们需要修改一
下ElementImpl.java文件:
...
public String getAttribute(final String name)
{
//METHOD-BODY-START - autogenerated code
Callable<String> c = new Callable<String>() { public String call() {
String result = getInstanceAsnsIDOMElement().getAttribute(name);
return result;
}};
return ThreadProxy.getSingleton().syncExec(c);
//METHOD-BODY-END - autogenerated code
}
...
public String getTagName()
{
//METHOD-BODY-START - autogenerated code
Callable<String> c = new Callable<String>() { public String call() {
String result = getInstanceAsnsIDOMElement().getTagName();
return result;
}};
return ThreadProxy.getSingleton().syncExec(c);
//METHOD-BODY-END - autogenerated code
}
...
6.3 安装我们的补丁来转换Mozilla DOM Tree成 W3CDOM Tree
解压源代码并且下载补丁http://ladyr.es/wiki/attachment/wiki/XPCOMGuide/mozdom4java_patch.diff。然后cd到包含src的目录
下,执行如下命令:
对应Linux用户
patch -p0 < moz4java_patch.diff
对应Windows用户
暂无
然后你就可以变异mozdom4java库了,当然需要如下的jar包:
w3chtml.jar
w3cextension.jar
MozillaInterfaces.jar
MozillaGlue.jar
6.4 测试补丁后的库