XULRunner with Java: JavaXPCOM Tutorial 3

6 加载页面的W3C DOM访问

6.1 mozdom4java库
  访问W3C DOM树比访问Mozilla的DOM树要好,因为它是一个动态访问HTML和XML的DOM树的标准。为了实现这个,我们使用从Mozilla

DOM到W3C DOM的java Bridge。有一个叫做mozdom4java的项目http://mozdom4java.mozdev.org/index.html。
  下载这个包后,我们把jar包放到classpath里。例如,我们增加一个按钮来抽取HTML文档里的所有链接。

 

    // When that button is pressed, then we obtain the HTML document corresponding to  
    // the URL loaded in browser. Next, we extract all its child nodes with 'a' tag name  
    // and print its content.  
    final ToolItem anchorItem = new ToolItem(toolbar, SWT.PUSH);  
    anchorItem.setImage(getImage("resources/anchors.png"));  
    anchorItem.addSelectionListener(new SelectionAdapter() {  
            public void widgetSelected(SelectionEvent event) {  
                     
                    // First, we obtain a Mozilla DOM Document representation  
                    nsIDOMDocument doc = browser.getDocument();  
                                     
                    // Get all anchors from the loaded HTML document  
                    nsIDOMNodeList nodeList = doc.getElementsByTagName("a");  
                    for ( int i = 0; i < nodeList.getLength(); i++ ){  
                                             
                            // Get Mozilla DOM node  
                            nsIDOMNode mozNode = nodeList.item(i);  
                                             
                            // Get the appropiate interface  
                            nsIDOMHTMLAnchorElement mozAnchor =  
                                    (nsIDOMHTMLAnchorElement) mozNode.queryInterface(  
                                                    nsIDOMHTMLAnchorElement.NS_IDOMHTMLANCHORELEMENT_IID);  
                                             
                            // Get the corresponding W3C DOM node  
                            HTMLAnchorElement a = (HTMLAnchorElement)  
                                    HTMLAnchorElementImpl.getDOMInstance(mozAnchor);  
                                                                             
                            // Test the HTML element  
                            System.out.println("Tag Name: " + a.getNodeName() + " -- Text: " + a.getTextContent()  
                                            + " -- Href: " + a.getHref());  
                                             
                    }  
                     
            }  
    });  
    ...  
 



6.2 给mozdom4j打补丁来实现mozilla DOM Tree到 W3C DOM Tree的转换
 
如果我们总想使用W3C DOM Tree,节点的转换可能有点麻烦。我们建议修改mozdom4java。在我们看来,这些修改简化了代码,因为

我们可以忘掉Mozilla DOM节点。最后,当我们讨论XPath时evaluator将返回一个节点的list,操作W3C element比Mozilla的node方

便,换句话说,我们的目标是构建一个可用的web browser,用标准的方法使用它而不用知道Mozilla实现的任何知识。

首先,我们需要下载Java Language Binding for DOM Level 2规范。比较好的做法是下载mozdom4java项目的jar包,

http://www.mozdev.org/source/browse/mozdom4java/src/jars/,因为他们包含了所有需要的文件,包括手工的扩展,因此我们不

需要关心任何东西。此外,我们也需要Mozilla接口。所有需要的文件:
  w3chtml.jar 包含了W3C DOM HTML level 2的接口,分成两个包 org.w3c.dom.html 和 org.w3c.dom.html2
  w3cextension.jar 包含 KeyEvent 类于org.w3c.dom.events包中。
  MozillaInterfaces.jar
  MozillaGlue.jar
 
  当你把这些jar包扔到classpaht后,mozdom4java应该可以很好的编译(没有错误,可能有一下警告)。下面我们将修改

mozdom4java的源代码。我们将逐个文件的解释这些修改。当然,你可以直接下载修改好的jar包。
  要手工patch这些库,请follow下面的步骤:
  我们将要创建一个HMTL element的factory,这个类能转换Mozilla DOM element节点为相应的W3C DOM element节点。下面的类就

做了这件事情并且包含了许多注释。它使用了java反射来做前面的事情,这种方式可以让你不需要知道任何Mozilla DOM节点。
  注:代码虽然很长,其实非常简单,就是用反射来调用前面的getDOMInstance方法
package es.ladyr.dom;

    import java.lang.reflect.Field;  
    import java.lang.reflect.Method;  
    import java.util.HashMap;  
    import java.util.Map;  
    import org.mozilla.interfaces.*;  
    import org.w3c.dom.html.HTMLElement;  
    public class HTMLElementFactory {  
            private static HTMLElementFactory instance;  
             
            private Map<String, String> corresp;  
                     
            private HTMLElementFactory() {  
                    initCorrespondence();  
            }  
             
            public static HTMLElementFactory getInstance(){  
                    if(instance == null){  
                            instance = new HTMLElementFactory();  
                    }  
                    return instance;  
            }  
             
            public static HTMLElement getHTMLElement(nsIDOMNode nsNode) {  
                    return getInstance().getConcreteNode(nsNode);  
            }  
             
            private void initCorrespondence() {  
                    corresp = new HashMap<String, String>();  
                    corresp.put("a", "Anchor");  
                    corresp.put("applet", "Applet");  
                    corresp.put("area", "Area");  
                    corresp.put("base", "Base");  
                    corresp.put("basefont", "BaseFont");  
                    corresp.put("body", "Body");  
                    corresp.put("br", "BR");  
                    corresp.put("button", "Button");  
                    corresp.put("dir", "Directory");  
                    corresp.put("div", "Div");  
                    corresp.put("dl", "DList");  
                    corresp.put("fieldset", "FieldSet");  
                    corresp.put("font", "Font");  
                    corresp.put("form", "Form");  
                    corresp.put("frame", "Frame");  
                    corresp.put("frameset", "FrameSet");  
                    corresp.put("head", "Head");  
                    corresp.put("h1", "Heading");  
                    corresp.put("h2", "Heading");  
                    corresp.put("h3", "Heading");  
                    corresp.put("h4", "Heading");  
                    corresp.put("h5", "Heading");  
                    corresp.put("h6", "Heading");  
                    corresp.put("hr", "HR");  
                    corresp.put("html", "Html");  
                    corresp.put("iframe", "IFrame");  
                    corresp.put("img", "Image");  
                    corresp.put("input", "Input");  
                    corresp.put("isindex", "IsIndex");  
                    corresp.put("label", "Label");  
                    corresp.put("legend", "Legend");  
                    corresp.put("li", "LI");  
                    corresp.put("link", "Link");  
                    corresp.put("map", "Map");  
                    corresp.put("menu", "Menu");  
                    corresp.put("meta", "Meta");  
                    corresp.put("ins", "Mod");  
                    corresp.put("del", "Mod");  
                    corresp.put("object", "Object");  
                    corresp.put("ol", "OList");  
                    corresp.put("optgroup", "OptGroup");  
                    corresp.put("option", "Option");  
                    corresp.put("p", "Paragraph");  
                    corresp.put("param", "Param");  
                    corresp.put("pre", "Pre");  
                    corresp.put("q", "Quote");  
                    corresp.put("script", "Script");  
                    corresp.put("select", "Select");  
                    corresp.put("style", "Style");  
                    corresp.put("caption", "TableCaption");  
                    corresp.put("td", "TableCell");  
                    corresp.put("col", "TableCol");  
                    corresp.put("table", "Table");  
                    corresp.put("tr", "TableRow");  
                    corresp.put("thead", "TableSection");  
                    corresp.put("tfoot", "TableSection");  
                    corresp.put("tbody", "TableSection");  
                    corresp.put("textarea", "TextArea");  
                    corresp.put("title", "Title");  
                    corresp.put("ul", "UList");  
            }  
            /** 
             * Try to convert a Mozilla DOM node into W3C DOM element. 
             * 
             * @param nsNode        node to convert into W3C DOM element. 
             * @return      W3C HTML element corresponding to a Mozilla DOM node. 
             */  
            public HTMLElement getConcreteNode(nsIDOMNode nsNode) {  
                    // Only converts element nodes. If the mozilla node  
                    // isn't a Mozilla DOM element, we cannot convert into  
                    // an W3C DOM element  
                    if (nsNode.getNodeType() == nsIDOMNode.ELEMENT_NODE) {  
                            // We use a hashmap to obtain element names from node names  
                            String htmlElementType = corresp.get(nsNode.getNodeName()  
                                            .toLowerCase());  
                             
                            // If we don't know the element type, we cannot transform  
                            // that node into W3C DOM element  
                            if(htmlElementType == null){  
                                    return null;  
                            }  
                             
                            // Compose the class name for the Mozilla DOM element.  
                            String nsClassName = "org.mozilla.interfaces.nsIDOMHTML"  
                                            + htmlElementType + "Element";  
                             
                            // Compose the field name for the element IID  
                            String nsFieldInterfaceName = "NS_IDOMHTML"  
                                            + htmlElementType.toUpperCase() + "ELEMENT_IID";  
                            try {  
                                    // Once we have their names, obtain the class and the field  
                                    Class nsClass = Class.forName(nsClassName);  
                                    Field field = nsClass.getField(nsFieldInterfaceName);  
                                    // Get the field value (is a static field, so the argumentis ignored)  
                                    String iid = (String) field.get(null);  
                                     
                                    // Get the apropiate node interface  
                                    Object nsElement = nsNode.queryInterface(iid);  
                                                                     
                                    // Build the W3C DOM Element implementation class name  
                                    // (the package org.mozilla.dom.html contains concrete implementations  
                                    // for the W3C HTML element interfaces)  
                                    String w3cClassName = "org.mozilla.dom.html.HTML"  
                                                    + htmlElementType + "ElementImpl";  
                                     
                                    // Obtain the class for the corresponding W3C DOM Element implementation  
                                    Class w3cClass = Class.forName(w3cClassName);  
                                     
                                    // Extract the method that must be invoked to transform the element  
                                    Method creationMethod = w3cClass.getMethod("getDOMInstance", nsClass);  
                                     
                                    // Invokes getDOMInstance method of corresponding W3C HTML element  
                                    //  which returns an instance of corresponding W3C HTML element  
                                    HTMLElement node = (HTMLElement) creationMethod.invoke(null, nsElement);  
                                    return node;  
                            } catch (Exception e) {  
                                    throw new Error(e);  
                            }  
                    }  
                    return null;  
            }  
    }  
 



利用我们的HTMLElementFactory类,我们将要修改NodeFactory类。修改后你可以调用org.w3c.dom.Node getNodeInstance

(nsIDOMNode node),当输入是类型是nsIDOMNode.ELEMENT_NODE时,返回的是与之对应的W3C DOM element。
修改的代码如下:

    ...  
          // Import our factory to create W3C HTML elements from Mozilla DOM elements  
          import es.ladyr.dom.HTMLElementFactory;  
    ...  
            public static Node getNodeInstance( nsIDOMNode node )  
            {  
            if (node == null) {  
                    return null;  
            }  
             
            switch ( node.getNodeType() )  
            {  
                case nsIDOMNode.ELEMENT_NODE:  
                    // Use our factory to obtain a W3C HTML DOM element  
                    Node htmlElement = HTMLElementFactory.getHTMLElement(node);  
                    if (htmlElement != null) {  
                            return htmlElement;  
                    } else {  
                            // If factory cannot convert the concrete node (for instance,  
                            // the type is unknown for our factory implementation), then  
                            // returns a generic W3C DOM element  
                            return ElementImpl.getDOMInstance((nsIDOMElement) node  
                                            .queryInterface(nsIDOMElement.NS_IDOMELEMENT_IID));  
                    }  
    ...  

 


下面是NodeFactory 类的完整代码:

    /* ***** BEGIN LICENSE BLOCK ***** 
     * Version: MPL 1.1/GPL 2.0/LGPL 2.1 
     * 
     * The contents of this file are subject to the Mozilla Public License Version 
     * 1.1 (the "License"); you may not use this file except in compliance with 
     * the License. You may obtain a copy of the License at 
     * http://www.mozilla.org/MPL/ 
     * 
     * Software distributed under the License is distributed on an "AS IS" basis, 
     * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License 
     * for the specific language governing rights and limitations under the 
     * License. 
     * 
     * The Original Code is mozdom4java 
     * 
     * The Initial Developer of the Original Code is 
     * Peter Szinek, Lixto Software GmbH, http://www.lixto.com. 
     * Portions created by the Initial Developer are Copyright (C) 2005-2006 
     * the Initial Developer. All Rights Reserved. 
     * 
     * Contributor(s): 
     *  Peter Szinek ([email protected]) 
     *  Michal Ceresna ([email protected]) 
     * 
     * Alternatively, the contents of this file may be used under the terms of 
     * either the GNU General Public License Version 2 or later (the "GPL"), or 
     * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), 
     * in which case the provisions of the GPL or the LGPL are applicable instead 
     * of those above. If you wish to allow use of your version of this file only 
     * under the terms of either the GPL or the LGPL, and not to allow others to 
     * use your version of this file under the terms of the MPL, indicate your 
     * decision by deleting the provisions above and replace them with the notice 
     * and other provisions required by the GPL or the LGPL. If you do not delete 
     * the provisions above, a recipient may use your version of this file under 
     * the terms of any one of the MPL, the GPL or the LGPL. 
     * 
     * ***** END LICENSE BLOCK ***** */  
    import org.w3c.dom.Node;  
    import org.mozilla.dom.*;  
    import org.mozilla.interfaces.*;  
    public class NodeFactory  
    {  
            private NodeFactory()  
            {}  
             
            public static Node getNodeInstance( nsIDOMEventTarget eventTarget )  
            {  
                    if (eventTarget == null ) {  
                            return null;  
                    }  
            nsIDOMNode node = (nsIDOMNode) eventTarget.queryInterface(nsIDOMNode.NS_IDOMNODE_IID);  
            return getNodeInstance(node);  
        }  
        
        public static Node getNodeInstance( nsIDOMNode node )  
        {  
            if (node == null) {  
                    return null;  
            }  
             
            switch ( node.getNodeType() )  
            {  
                case nsIDOMNode.ELEMENT_NODE:  
                    // Use our factory to obtain a W3C HTML DOM element  
                    Node htmlElement = HTMLElementFactory.getHTMLElement(node);  
                    if (htmlElement != null) {  
                            return htmlElement;  
                    } else {  
                            // If factory cannot convert the concrete node (for instance,  
                            // the type is unknown for our factory implementation), then  
                            // returns a generic W3C DOM element  
                            return ElementImpl.getDOMInstance((nsIDOMElement) node  
                                            .queryInterface(nsIDOMElement.NS_IDOMELEMENT_IID));  
                    }  
                case nsIDOMNode.ATTRIBUTE_NODE: return AttrImpl.getDOMInstance((nsIDOMAttr) node.queryInterface  
    (nsIDOMAttr.NS_IDOMATTR_IID));  
                case nsIDOMNode.TEXT_NODE: return TextImpl.getDOMInstance((nsIDOMText) node.queryInterface  
    (nsIDOMText.NS_IDOMTEXT_IID));  
                case nsIDOMNode.CDATA_SECTION_NODE: return CDATASectionImpl.getDOMInstance((nsIDOMCDATASection)  
    node.queryInterface(nsIDOMCDATASection.NS_IDOMCDATASECTION_IID));  
                case nsIDOMNode.ENTITY_REFERENCE_NODE: return EntityReferenceImpl.getDOMInstance((nsIDOMEntityReference)  
    node.queryInterface(nsIDOMEntityReference.NS_IDOMENTITYREFERENCE_IID));    
                case nsIDOMNode.ENTITY_NODE: return EntityImpl.getDOMInstance((nsIDOMEntity) node.queryInterface  
    (nsIDOMEntity.NS_IDOMENTITY_IID));  
                case nsIDOMNode.PROCESSING_INSTRUCTION_NODE: return ProcessingInstructionImpl.getDOMInstance  
    ((nsIDOMProcessingInstruction) node.queryInterface(nsIDOMProcessingInstruction.NS_IDOMPROCESSINGINSTRUCTION_IID));  
                case nsIDOMNode.COMMENT_NODE: return CommentImpl.getDOMInstance((nsIDOMComment) node.queryInterface  
    (nsIDOMComment.NS_IDOMCOMMENT_IID));  
                case nsIDOMNode.DOCUMENT_NODE: return DocumentImpl.getDOMInstance((nsIDOMDocument) node.queryInterface  
    (nsIDOMDocument.NS_IDOMDOCUMENT_IID));  
                case nsIDOMNode.DOCUMENT_TYPE_NODE: return DocumentTypeImpl.getDOMInstance((nsIDOMDocumentType)  
    node.queryInterface(nsIDOMDocumentType.NS_IDOMDOCUMENTTYPE_IID));  
                case nsIDOMNode.DOCUMENT_FRAGMENT_NODE: return DocumentFragmentImpl.getDOMInstance  
    ((nsIDOMDocumentFragment) node.queryInterface(nsIDOMDocumentFragment.NS_IDOMDOCUMENTFRAGMENT_IID));  
                case nsIDOMNode.NOTATION_NODE: return NotationImpl.getDOMInstance((nsIDOMNotation) node.queryInterface  
    (nsIDOMNotation.NS_IDOMNOTATION_IID));  
                default: return NodeImpl.getDOMInstance(node);  
            }  
        }  
         
        public static nsIDOMNode getnsIDOMNode( Node node )  
        {  
            if (node instanceof NodeImpl) {  
                NodeImpl ni = (NodeImpl) node;  
                return ni.getInstance();  
            }  
            else {       
                return null;  
            }  
        }  
         
        private static boolean toLower = true;  
        public static boolean getConvertNodeNamesToLowerCase()  
        {  
            return toLower;  
        }  
        public static void setConvertNodeNamesToLowerCase(boolean convert)  
        {  
            toLower = convert;  
        }  
        private static boolean expandFrames = false;  
        public static boolean getExpandFrames()  
        {  
            return expandFrames;  
        }  
        public static void setExpandFrames(boolean expand)  
        {  
            expandFrames = expand;  
        }  
    }  

 



最后,我们需要修改ElementImpl类。这个类有两个方法, public String getAttribute(String name) 和 public String

getTagName() ,这个两个方法最后会调用toLowerCase来把结果变成小写。这可能会带来问题,比如,一个anchor的属性可能是

onclick,这个属性的值可能包含JavaScript代码。如果我们需要执行这段JavaScript代码,那么可能会有问题。所以我们需要修改一

下ElementImpl.java文件:

    ...  
        public String getAttribute(final String name)  
        {  
            //METHOD-BODY-START - autogenerated code  
            Callable<String> c = new Callable<String>() { public String call() {  
                String result = getInstanceAsnsIDOMElement().getAttribute(name);  
                return result;  
            }};  
            return ThreadProxy.getSingleton().syncExec(c);  
            //METHOD-BODY-END - autogenerated code  
        }  
          ...  
        public String getTagName()  
        {  
            //METHOD-BODY-START - autogenerated code  
            Callable<String> c = new Callable<String>() { public String call() {  
                String result = getInstanceAsnsIDOMElement().getTagName();  
                return result;  
            }};  
            return ThreadProxy.getSingleton().syncExec(c);  
            //METHOD-BODY-END - autogenerated code  
        }  
          ...  

 


6.3 安装我们的补丁来转换Mozilla DOM Tree成 W3CDOM Tree
解压源代码并且下载补丁http://ladyr.es/wiki/attachment/wiki/XPCOMGuide/mozdom4java_patch.diff。然后cd到包含src的目录

下,执行如下命令:
   对应Linux用户
     patch -p0 < moz4java_patch.diff
   对应Windows用户
     暂无
   然后你就可以变异mozdom4java库了,当然需要如下的jar包:
     w3chtml.jar
     w3cextension.jar
     MozillaInterfaces.jar
     MozillaGlue.jar
   
6.4 测试补丁后的库

    ...  
    import org.mozilla.dom.NodeFactory;  
    import org.mozilla.interfaces.*;  
    import org.w3c.dom.html.HTMLAnchorElement;  
    import org.w3c.dom.html.HTMLElement;  
    ...  
                    final ToolItem anchorItem = new ToolItem(toolbar, SWT.PUSH);  
                    anchorItem.setImage(getImage("resources/anchors.png"));  
                    anchorItem.addSelectionListener(new SelectionAdapter() {  
                            public void widgetSelected(SelectionEvent event) {  
                             
    //                               First, we obtain a Mozilla DOM Document representation  
                                    nsIWebBrowser webBrowser = (nsIWebBrowser)browser.getWebBrowser();  
                                    if (webBrowser == null) {  
                                            System.out.println("Could not get the nsIWebBrowser from the Browser  
    widget");  
                                    }        
                             
                                    nsIDOMWindow window = webBrowser.getContentDOMWindow();  
                                    nsIDOMDocument doc = window.getDocument();  
                                    System.out.println(doc);  
                             
                            // Get all anchors from the loaded HTML document  
                            nsIDOMNodeList nodeList = doc.getElementsByTagName("a");  
                             
                            analyzeAnchors(nodeList);  
                             
                    }  
                            private void analyzeAnchors(nsIDOMNodeList nodeList) {  
                            for (int i = 0; i < nodeList.getLength(); i++) {  
                                    // Get Mozilla DOM node  
                                    nsIDOMNode mozNode = nodeList.item(i);  
                                    // We are supposing that the NodeList contains only HTMLElements  
                                    // because we only call this method over HTML nodes  
                                    // (NodeFactory.getNodeInstance could returns another node  
                                    //  descendants, depends on the input Mozilla DOM node)  
                                    HTMLElement htmlElement = (HTMLElement) NodeFactory.getNodeInstance(mozNode);  
                                    // We only are interested in anchors  
                                    if (htmlElement instanceof HTMLAnchorElement) {  
                                                     
                                            HTMLAnchorElement a = (HTMLAnchorElement) htmlElement;  
                                                     
                                            // Test the HTML element  
                                            System.out.println("Tag Name: " + a.getNodeName()  
                                                            + " -- Text: " + a.getTextContent()  
                                                            + " -- Href: " + a.getHref());  
                                    }  
                            }  
                    }  
                     
            });  
 

你可能感兴趣的:(java)