


try {

    Parser parser = new Parser();



    NodeVisitor visitor = new NodeVisitor() {

        public void visitTag(Tag tag) {

            System.out.println (”testVisitorAll()  Tag name is :”

                    + tag.getTagName() + ” \n Class is :”

                    + tag.getClass());




} catch (ParserException e) {




try {

    NodeFilter filter = new NodeClassFilter(LinkTag.class);

    Parser parser = new Parser();



    NodeList list = parser.extractAllNodesThatMatch(filter);

    for (int i = 0; i < list.size(); i++) {

        LinkTag node = (LinkTag) list.elementAt(i);

        System.out.println(”testLinkTag() Link is :” + node.extractLink());


} catch (Exception e) {




另外htmlparser 还在org.htmlparser.beans中对一些常用的方法进行了封装,以简化操作,例如:

Parser parser = new Parser();

LinkBean linkBean = new LinkBean();


URL[] urls = linkBean.getLinks();

for (int i = 0; i < urls.length; i++) {

    URL url = urls[i];

    System.out.println (”testLinkBean() -url  is :” + url);








    Parserhtmlparser的最核心的类,其构造函数提供了如下:Parser.createParser (String html, String charset) Parser ()Parser (Lexer lexer, ParserFeedback fb)Parser (URLConnection connection, ParserFeedback fb)Parser (String resource, ParserFeedback feedback) Parser (String resource)




    Parser parser = new Parser (””);

    for (NodeIterator i = parser.elements (); i.hasMoreElements (); )

      processMyNodes (i.nextNode ());

       parse (NodeFilter filter):通过NodeFilter方式获取

       visitAllNodesWith (NodeVisitor visitor):通过Nodevisitor方式

       extractAllNodesThatMatch (NodeFilter filter):通过NodeFilter方式









    定义了htmlparser所提供的各种filter,主要通过extractAllNodesThatMatch (NodeFilter filter)来对html页面指定类型的元素进行过滤,包括:AndFilterCssSelectorNodeFilterHasAttributeFilterHasChildFilterHasParentFilterHasSiblingFilterIsEqualFilterLinkRegexFilterLinkStringFilterNodeClassFilterNotFilterOrFilterRegexFilterStringFilterTagNameFilterXorFilter


   定义了htmlparser所提供的各种visitor,主要通过visitAllNodesWith (NodeVisitor visitor)来对html页面元素进行遍历,包括:HtmlPageLinkFindingVisitorNodeVisitorObjectFindingVisitorStringFindingVisitorTagFindingVisitorTextExtractingVisitorUrlModifyingVisitor









1 . 逻辑关系:与或非

          Creates a new instance of an AndFilter.
AndFilter(NodeFilter[] predicates) 
          Creates an AndFilter that accepts nodes acceptable to all given filters.
AndFilter(NodeFilter left, NodeFilter right) 
          Creates an AndFilter that accepts nodes acceptable to both filters.



          Creates a new instance of an OrFilter.
OrFilter(NodeFilter[] predicates) 
          Creates an OrFilter that accepts nodes acceptable to any of the given filters.
OrFilter(NodeFilter left, NodeFilter right) 
          Creates an OrFilter that accepts nodes acceptable to either filter.




          Creates a new instance of an OrFilter.
OrFilter(NodeFilter[] predicates) 
          Creates an OrFilter that accepts nodes acceptable to any of the given filters.
OrFilter(NodeFilter left, NodeFilter right) 
          Creates an OrFilter that accepts nodes acceptable to either filter.


2. 内容

StringFilter:功能简单有限;复杂功能可使用RegexFilter (正则表达式)


          Creates a new instance of StringFilter that accepts all string nodes.
StringFilter(String pattern) 
          Creates a StringFilter that accepts text nodes containing a string.
StringFilter(String pattern, boolean sensitive) 
          Creates a StringFilter that accepts text nodes containing a string.
StringFilter(String pattern, boolean sensitive, Locale locale) 
          Creates a StringFilter that accepts text nodes containing a string.
          Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.
RegexFilter(String pattern) 
          Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.
RegexFilter(String pattern, int strategy) 
          Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.


3 标签

TagNameFilter()利用标签名过滤 : div ,img , ...

NodeClassFilter()利用标签类别 :LinkTag.class ...

HasAttributeFilter()利用属性 :HasAttributeFilter(“class”“className”)



          Creates a new instance of TagNameFilter.
TagNameFilter(String name) 
          Creates a TagNameFilter that accepts tags with the given name.
          Creates a NodeClassFilter that accepts Html tags.
NodeClassFilter(Class cls) 
          Creates a NodeClassFilter that accepts tags of the given class.
          Creates a new instance of HasAttributeFilter.
HasAttributeFilter(String attribute) 
          Creates a new instance of HasAttributeFilter that accepts tags with the given attribute.
HasAttributeFilter(String attribute, String value) 
          Creates a new instance of HasAttributeFilter that accepts tags with the given attribute and value.
LinkRegexFilter(String regexPattern) 
          Creates a LinkRegexFilter that accepts LinkTag nodes containing a URL that matches the supplied regex pattern.
LinkRegexFilter(String regexPattern, boolean caseSensitive) 
          Creates a LinkRegexFilter that accepts LinkTag nodes containing a URL that matches the supplied regex pattern.
LinkStringFilter(String pattern) 
          Creates a LinkStringFilter that accepts LinkTag nodes containing a URL that matches the supplied pattern.
LinkStringFilter(String pattern, boolean caseSensitive) 
          Creates a LinkStringFilter that accepts LinkTag nodes containing a URL that matches the supplied pattern.


4 层次关系

          Creates a new instance of HasParentFilter.
HasParentFilter(NodeFilter filter) 
          Creates a new instance of HasParentFilter that accepts nodes with the direct parent acceptable to the filter.
HasParentFilter(NodeFilter filter, boolean recursive) 
          Creates a new instance of HasParentFilter that accepts nodes with a parent acceptable to the filter.

          Creates a new instance of a HasChildFilter.
HasChildFilter(NodeFilter filter) 
          Creates a new instance of HasChildFilter that accepts nodes with a direct child acceptable to the filter.
HasChildFilter(NodeFilter filter, boolean recursive) 
          Creates a new instance of HasChildFilter that accepts nodes with a child acceptable to the filter.







