对于HtmlParser的使用,这方面的介绍很多,而且详细。前段时间我将HtmlParser的源码读了一篇,在此,总结下其HtmlParser的设计,跟大家交流,我们只关注是设计。
一、Filter设计
NodeFilter 是htmlParser主要的提取节点的一种方式,其结构灵活,通过组合解释器查找页面上的任一个节点。
1、先看个测试用例:
- /**
- * Test and filtering.
- */
- public void testAnd () throws ParserException
- {
- String guts;
- String html;
- NodeList list;
-
- guts = "<body>Now is the <a id=one><b>time</b></a> for all good <a id=two><b>men</b></a>..</body>";
- html = "<html>" + guts + "</html>";
- createParser (html);
- list = parser.extractAllNodesThatMatch (
- new AndFilter (
- new HasChildFilter (
- new TagNameFilter ("b")),
- new HasChildFilter (
- new StringFilter ("men")))
- );
- assertEquals ("only one element", 1, list.size ());
- assertType ("should be LinkTag", LinkTag.class, list.elementAt (0));
- LinkTag link = (LinkTag)list.elementAt (0);
- assertEquals ("attribute value", "two", link.getAttribute ("id"));
- }
/** * Test and filtering. */ public void testAnd () throws ParserException { String guts; String html; NodeList list; guts = "<body>Now is the <a id=one><b>time</b></a> for all good <a id=two><b>men</b></a>..</body>"; html = "<html>" + guts + "</html>"; createParser (html); list = parser.extractAllNodesThatMatch ( new AndFilter ( new HasChildFilter ( new TagNameFilter ("b")), new HasChildFilter ( new StringFilter ("men"))) ); assertEquals ("only one element", 1, list.size ()); assertType ("should be LinkTag", LinkTag.class, list.elementAt (0)); LinkTag link = (LinkTag)list.elementAt (0); assertEquals ("attribute value", "two", link.getAttribute ("id")); }
2、NodeFilter 结构图
3、所使用的设计模式
NodeFilter接口的主要作用是判断该节点是否是客户端所查找的节点,返回一个boolean值。从上图中也可以看出,其接口中只有一个方法:
boolean accept (Node node); //接受一个Node类型的参数
在这,HtmlParser作者采用的是解析器模式来实现这个模式。
我们先了解下解释器模式,然后再结合作者的源码来理解解释器模式,体会作者的设计灵活性。
Interpreter模式可以定义出其方法的一种表示,并同时提供一个解释器。客户端可以使用解释器来解释这个语言中的句子。
其中,Interpreter模式的几个要点:
1、Interpreter模式应用场合是Interpreter模式应用中的难点,只有满足“业务规则频繁变化,且类似的模式不断重复出现,并且容易抽象为语法规则问题”才适合使用Interpreter模式
2、使用Interpreter模式来表示方法规则,从而可以使用面向对象技艺来方便地“扩展”方法。
4、HtmlParser NodeFilter 解释器模式的应用
抽象表达式角色:
- public interface NodeFilter extends Serializable, Cloneable {
- /**
- * Predicate to determine whether or not to keep the given node.
- * The behaviour based on this outcome is determined by the context
- * in which it is called. It may lead to the node being added to a list
- * or printed out. See the calling routine for details.
- * @return <code>true</code> if the node is to be kept, <code>false</code>
- * if it is to be discarded.
- * @param node The node to test.
- */
- boolean accept (Node node);
- }
public interface NodeFilter extends Serializable, Cloneable { /** * Predicate to determine whether or not to keep the given node. * The behaviour based on this outcome is determined by the context * in which it is called. It may lead to the node being added to a list * or printed out. See the calling routine for details. * @return <code>true</code> if the node is to be kept, <code>false</code> * if it is to be discarded. * @param node The node to test. */ boolean accept (Node node); }
下面看一个逻辑“与”的操作的实现,这里表示二个过滤器通过逻辑与操作给出一个boolean表达式的操作。代码如下:
- /**
- * Accepts nodes matching all of its predicate filters (AND operation).
- */
- public class AndFilter implements NodeFilter {
- protected NodeFilter[] mPredicates;
-
- /**
- * Creates an AndFilter that accepts nodes acceptable to both filters.
- *
- * @param left One filter.
- * @param right The other filter.
- */
- public AndFilter(NodeFilter left, NodeFilter right) {
- NodeFilter[] predicates;
-
- predicates = new NodeFilter[2];
- predicates[0] = left;
- predicates[1] = right;
- setPredicates(predicates);
- }
-
- public void setPredicates(NodeFilter[] predicates) {
- if (null == predicates)
- predicates = new NodeFilter[0];
- mPredicates = predicates;
- }
-
- public boolean accept(Node node) {
- boolean ret;
-
- ret = true;
-
- for (int i = 0; ret && (i < mPredicates.length); i++)
- if (!mPredicates[i].accept(node)) // 这里调用本身构造的解释器再进行判断
- ret = false;
-
- return (ret);
- }
- }
/** * Accepts nodes matching all of its predicate filters (AND operation). */ public class AndFilter implements NodeFilter { protected NodeFilter[] mPredicates; /** * Creates an AndFilter that accepts nodes acceptable to both filters. * * @param left One filter. * @param right The other filter. */ public AndFilter(NodeFilter left, NodeFilter right) { NodeFilter[] predicates; predicates = new NodeFilter[2]; predicates[0] = left; predicates[1] = right; setPredicates(predicates); } public void setPredicates(NodeFilter[] predicates) { if (null == predicates) predicates = new NodeFilter[0]; mPredicates = predicates; } public boolean accept(Node node) { boolean ret; ret = true; for (int i = 0; ret && (i < mPredicates.length); i++) if (!mPredicates[i].accept(node)) // 这里调用本身构造的解释器再进行判断 ret = false; return (ret); } }
再来看一个测试用例中的另外一些过滤操作,HasChildFilter 其代码如下:
- public class HasChildFilter implements NodeFilter {
- protected NodeFilter mChildFilter;
-
- protected boolean mRecursive;
-
- public HasChildFilter(NodeFilter filter) {
- this(filter, false);
- }
-
- public HasChildFilter(NodeFilter filter, boolean recursive) {
- mChildFilter = filter;
- mRecursive = recursive;
- }
-
- public boolean accept(Node node) {
- CompositeTag tag; // ?1
- NodeList children;
- boolean ret;
-
- ret = false;
- if (node instanceof CompositeTag) {
- tag = (CompositeTag) node;
- children = tag.getChildren();
- if (null != children) {
- for (int i = 0; !ret && i < children.size(); i++)
- if (mChildFilter.accept(children.elementAt(i))) // 判断是否包括该元素
- ret = true;
- // do recursion after all children are checked
- // to get breadth first traversal
- if (!ret && mRecursive) // 搜索下层节点
- for (int i = 0; !ret && i < children.size(); i++)
- if (accept(children.elementAt(i)))
- ret = true;
- }
- }
-
- return (ret);
- }
- }
public class HasChildFilter implements NodeFilter { protected NodeFilter mChildFilter; protected boolean mRecursive; public HasChildFilter(NodeFilter filter) { this(filter, false); } public HasChildFilter(NodeFilter filter, boolean recursive) { mChildFilter = filter; mRecursive = recursive; } public boolean accept(Node node) { CompositeTag tag; // ?1 NodeList children; boolean ret; ret = false; if (node instanceof CompositeTag) { tag = (CompositeTag) node; children = tag.getChildren(); if (null != children) { for (int i = 0; !ret && i < children.size(); i++) if (mChildFilter.accept(children.elementAt(i))) // 判断是否包括该元素 ret = true; // do recursion after all children are checked // to get breadth first traversal if (!ret && mRecursive) // 搜索下层节点 for (int i = 0; !ret && i < children.size(); i++) if (accept(children.elementAt(i))) ret = true; } } return (ret); } }
TagNameFilter 的代码如下:
- public class TagNameFilter implements NodeFilter {
- protected String mName;
-
- public TagNameFilter(String name) {
- mName = name.toUpperCase(Locale.ENGLISH);
- }
-
- public boolean accept(Node node) {
- return ((node instanceof Tag)
- && !((Tag) node).isEndTag()
- && ((Tag) node).getTagName().equals(mName));
- }
- }
public class TagNameFilter implements NodeFilter { protected String mName; public TagNameFilter(String name) { mName = name.toUpperCase(Locale.ENGLISH); } public boolean accept(Node node) { return ((node instanceof Tag) && !((Tag) node).isEndTag() && ((Tag) node).getTagName().equals(mName)); } }
NodeFilter的另外13个子类,都按此实现包装不同的业务逻辑。并且非常容易增加其子类来实现新的“文法”规则。
客户端则可灵活组装解释器,执行解释。非常灵活,这也满足用户自定义逻辑去查找HTML文件中的各个节点。
至于HtmlParser是如何人存储HTML结构,在此不做深挖,只需要知道将提供一个迭代器可遍历所有的节点即可(其实HtmlParser中是通过遍历各个字符来映射Node对象及装载各字符的坐标(列数,行数))。
5、HtmlParser中客户端的调用
现在来看看测试用例中的Parser类中extractAllNodesThatMatch()。
Parser:
- public class Parser implements Serializable {
- ... ....
-
- /**
- * Extract all nodes matching the given filter.
- */
- public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException {
- NodeIterator e;
- NodeList ret;
-
- ret = new NodeList ();
- for (e = elements (); e.hasMoreNodes (); ) // elements()返回一个简单的迭代器,遍历所有节点
- e.nextNode ().collectInto (ret, filter);
-
- return (ret);
- }
- ... ...
- }
public class Parser implements Serializable { ... .... /** * Extract all nodes matching the given filter. */ public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException { NodeIterator e; NodeList ret; ret = new NodeList (); for (e = elements (); e.hasMoreNodes (); ) // elements()返回一个简单的迭代器,遍历所有节点 e.nextNode ().collectInto (ret, filter); return (ret); } ... ... }
AbstractNode:
- public abstract class AbstractNode implements Node, Serializable {
- ... ...
- public void collectInto (NodeList list, NodeFilter filter) {
- if (filter.accept (this))
- list.add (this);
- }
- ... ...
- }
public abstract class AbstractNode implements Node, Serializable { ... ... public void collectInto (NodeList list, NodeFilter filter) { if (filter.accept (this)) list.add (this); } ... ... }
- public class CompositeTag extends TagNode { //TagNode extends AbstractNode, AbstractNode implements Node
- ... ...
- public void collectInto (NodeList list, NodeFilter filter) {
- super.collectInto (list, filter); //AbstractNode collectInto
- for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) {
- // e.nextNode() 返回一个Node类型 e.nextNode ().collectInto() = this.collectInto() 递归遍历所有节点,并对每个节点进行过滤,将符合条件的节点添加至结果集中(NodeList)
- e.nextNode ().collectInto (list, filter);
- }
- if ((null != getEndTag ()) && (this != getEndTag ()))
- getEndTag ().collectInto (list, filter);
- }
- ... ...
- }