浅谈HtmlParser

  使用Heritrix抓取到自己所需的网页后,还需要对网页中的内容进行分类等操作,这个时候就需要用到htmlparser,但是使用htmlparser并不是那么容易!因为相关的文档比较少,很多更能需要开发者自己去摸索,去发掘!

  不过这里给大家提供一个比较好的网站(htmlparser的API):http://tool.oschina.net/apidocs/apidoc?api=HTMLParser,这个API是英文版的,英语不好的这时就要逼迫自己看下去了。

  HTMLParser的核心模块是org.htmlparser.Parser类,这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数:

public Parser ();
public Parser (Lexer lexer, ParserFeedback fb);
public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
public Parser (String resource, ParserFeedback feedback) throws ParserException;
public Parser (String resource) throws ParserException;
public Parser (Lexer lexer);
public Parser (URLConnection connection) throws ParserException;

和一个静态类

public static Parser createParser (String html, String charset);

  对于大多数使用者来说,使用最多的是通过一个URLConnection或者一个保存有网页内容的字符串来初始化Parser,或者使用静态函数来生成一个Parser对象。ParserFeedback的代码很简单,是针对调试和跟踪分析过程的,一般不需要改变。而使用Lexer则是一个相对比较高级的话题,放到以后再讨论吧。
  这里比较有趣的一点是,如果需要设置页面的编码方式的话,不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说,好像这是应该用得比较多的一个方法。

下面是初始化Parser的例子(通过打开一个网页的URL,中间的OpenFile方法是在打开一个本地的html文件时使用的)。

【加载的网页文件:index.html】

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
    <head>
        <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
        <title>百度</title>
        <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
    </head>
    <body>
        <div  align = "center" class = "photo" >
            <img src = "../image/baidu.PNG" >
        </div>
        <div align = "center" class = "body">
            <table cellpadding="8">
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
                </td>
                <td>
                    <font color = "black">网页</font>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
                </td>
                <td>
                    <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
                </td>
            </table>
            <input class = "input" >
        </div>
    </body>

</html>
View Code

 

【源码:htmlparser_1.java】

 1 import java.io.BufferedReader;
 2 import java.io.File;
 3 import java.io.FileInputStream;
 4 import java.io.InputStreamReader;
 5 import java.net.HttpURLConnection;
 6 import java.net.URL;
 7 import org.htmlparser.Parser;
 8 import org.htmlparser.visitors.TextExtractingVisitor;
 9 
10 public class Main {
11     private static String ENCODE = "GBK";
12     private static void message(String msg) {
13         // TODO Auto-generated method stub
14         try {
15             System.out.println(new String(msg.getBytes(ENCODE), System
16                     .getProperty("file.encoding")));
17         } catch (Exception e) {
18             // TODO: handle exception
19             e.printStackTrace();
20         }
21     }
22     
23     /*
24      * 打开一个文件
25      */
26     public static String OpenFile(String FileName) {
27         try {
28             File mFile = new File(FileName);
29             FileInputStream mFileInputStream = new FileInputStream(mFile);
30             InputStreamReader mInputStreamReader = new InputStreamReader(
31                     mFileInputStream, ENCODE);
32             BufferedReader mBufferedReader = new BufferedReader(
33                     mInputStreamReader);
34             String mContent = "";
35             String mTemp = "";
36             while ((mTemp = mBufferedReader.readLine()) != null) {
37                 mContent += mTemp + "\n";
38             }
39             mBufferedReader.close();
40         } catch (Exception e) {
41             // TODO: handle exception
42             e.printStackTrace();
43             return "";
44         }
45         return FileName;
46     }
47 
48     /*
49      * main方法
50      */
51     public static void main(String[] args) {
52         // String mContent=OpenFile("");
53         try {
54             Parser mParser = new Parser((HttpURLConnection) (new URL(
55                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
56             TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
57             mParser.visitAllNodesWith(mExtractingVisitor);
58             String textInPage = mExtractingVisitor.getExtractedText();
59             message(textInPage);
60         } catch (Exception e) {
61             // TODO: handle exception
62             e.printStackTrace();
63         }
64     }
65 
66 }

测试输出结果:

 1     
 2         
 3         百度
 4         
 5     
 6     
 7         
 8             
 9         
10         
11             
12                 
13                     新闻
14                 
15                 
16                     网页
17                 
18                 
19                     贴吧
20                 
21                 
22                     知道
23                 
24                 
25                     音乐
26                 
27                 
28                     图片
29                 
30                 
31                     视频
32                 
33                 
34                     地图
35                 
36             
37             
38         
39     
View Code

 HTMLParser将解析过的信息保存为一个树的结构。Node是信息保存的数据类型基础。

请看Node的定义:
public interface Node extends Cloneable;

Node中包含的方法有几类:

对于树型结构进行遍历的函数,这些函数最容易理解:

Node getParent ():取得父节点
NodeList getChildren ():取得子节点的列表
Node getFirstChild ():取得第一个子节点
Node getLastChild ():取得最后一个子节点
Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了)
Node getNextSibling ():取得下一个兄弟节点

 取得Node内容的函数:

String getText ():取得文本
String toPlainTextString():取得纯文本信息。
String toHtml () :取得HTML信息(原始HTML)
String toHtml (boolean verbatim):取得HTML信息(原始HTML)
String toString ():取得字符串信息(原始HTML)
Page getPage ():取得这个Node对应的Page对象
int getStartPosition ():取得这个Node在HTML页面中的起始位置
int getEndPosition ():取得这个Node在HTML页面中的结束位置

用于Filter过滤的函数:

void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。

 用于Visitor遍历的函数:

void accept (NodeVisitor visitor):对这个Node应用visitor

用于修改内容的函数,这类用得比较少:

void setPage (Page page):设置这个Node对应的Page对象
void setText (String text):设置文本
void setChildren (NodeList children):设置子节点列表

其他函数:

void doSemanticAction (): 执行这个Node对应的操作(只有少数Tag有对应的操作)
Object clone (): 接口Clone的抽象函数。

 实际我们用HTMLParser最多的是处理HTML页面,Filter或Visitor相关的函数是必须的,然后第一类和第二类函数是用得最多的。第一类函数比较容易理解,下面用例子说明一下第二类函数。

【源码:htmlparser_2.java】

 1 import java.io.BufferedReader;
 2 import java.io.File;
 3 import java.io.FileInputStream;
 4 import java.io.InputStreamReader;
 5 import java.net.HttpURLConnection;
 6 import java.net.URL;
 7 import org.htmlparser.Node;
 8 import org.htmlparser.Parser;
 9 import org.htmlparser.util.NodeIterator;
10 import org.htmlparser.visitors.TextExtractingVisitor;
11 import org.omg.CosNaming.NamingContextPackage.NotEmpty;
12 
13 public class Main {
14     private static String ENCODE = "utf-8";
15     private static void message(String msg) {
16         // TODO Auto-generated method stub
17         try {
18             System.out.println(new String(msg.getBytes(ENCODE), System
19                     .getProperty("file.encoding")));
20         } catch (Exception e) {
21             // TODO: handle exception
22             e.printStackTrace();
23         }
24     }
25     
26     /*
27      * 打开一个文件
28      */
29     public static String OpenFile(String FileName) {
30         try {
31             File mFile = new File(FileName);
32             FileInputStream mFileInputStream = new FileInputStream(mFile);
33             InputStreamReader mInputStreamReader = new InputStreamReader(
34                     mFileInputStream, ENCODE);
35             BufferedReader mBufferedReader = new BufferedReader(
36                     mInputStreamReader);
37             String mContent = "";
38             String mTemp = "";
39             while ((mTemp = mBufferedReader.readLine()) != null) {
40                 mContent += mTemp + "\n";
41             }
42             mBufferedReader.close();
43         } catch (Exception e) {
44             // TODO: handle exception
45             e.printStackTrace();
46             return "";
47         }
48         return FileName;
49     }
50 
51     /*
52      * main方法
53      */
54     public static void main(String[] args) {
55         // String mContent=OpenFile("");
56         try {
57             Parser mParser = new Parser((HttpURLConnection) (new URL(
58                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
59 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
60 //            mParser.visitAllNodesWith(mExtractingVisitor);
61 //            String textInPage = mExtractingVisitor.getExtractedText();
62 //            message(textInPage);
63             
64             for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {
65                 Node node = i.nextNode();
66                 message("getText:"+node.getText());
67                 message("getPlainText:"+node.toPlainTextString());
68                 message("toHtml:"+node.toHtml());
69                 message("toHtml(true):"+node.toHtml(true));
70                 message("tohtml(false):"+node.toHtml(false));
71                 message("toString:"+node.toString());
72                 message("==============================");
73             }
74         } catch (Exception e) {
75             // TODO: handle exception
76             e.printStackTrace();
77         }
78     }
79 }

测试输出结果:

  1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  2 getPlainText:
  3 toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  4 toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 tohtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  6 toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121
  7 ==============================
  8 getText:
  9 
 10 getPlainText:
 11 
 12 toHtml:
 13 
 14 toHtml(true):
 15 
 16 tohtml(false):
 17 
 18 toString:Txt (121[0,121],123[1,0]): \n
 19 ==============================
 20 getText:html
 21 getPlainText:
 22     
 23         
 24         百度
 25         
 26     
 27     
 28         
 29             
 30         
 31         
 32             
 33                 
 34                     新闻
 35                 
 36                 
 37                     网页
 38                 
 39                 
 40                     贴吧
 41                 
 42                 
 43                     知道
 44                 
 45                 
 46                     音乐
 47                 
 48                 
 49                     图片
 50                 
 51                 
 52                     视频
 53                 
 54                 
 55                     地图
 56                 
 57             
 58             
 59         
 60     
 61 
 62 
 63 toHtml:<html>
 64     <head>
 65         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
 66         <title>百度</title>
 67         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
 68     </head>
 69     <body>
 70         <div  align = "center" class = "photo" >
 71             <img src = "../image/baidu.PNG" >
 72         </div>
 73         <div align = "center" class = "body">
 74             <table cellpadding="8">
 75                 <td>
 76                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
 77                 </td>
 78                 <td>
 79                     <font color = "black">网页</font>
 80                 </td>
 81                 <td>
 82                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
 83                 </td>
 84                 <td>
 85                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
 86                 </td>
 87                 <td>
 88                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
 89                 </td>
 90                 <td>
 91                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
 92                 </td>
 93                 <td>
 94                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
 95                 </td>
 96                 <td>
 97                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
 98                 </td>
 99             </table>
100             <input class = "input" >
101         </div>
102     </body>
103 
104 </html>
105 toHtml(true):<html>
106     <head>
107         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
108         <title>百度</title>
109         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
110     </head>
111     <body>
112         <div  align = "center" class = "photo" >
113             <img src = "../image/baidu.PNG" >
114         </div>
115         <div align = "center" class = "body">
116             <table cellpadding="8">
117                 <td>
118                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
119                 </td>
120                 <td>
121                     <font color = "black">网页</font>
122                 </td>
123                 <td>
124                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
125                 </td>
126                 <td>
127                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
128                 </td>
129                 <td>
130                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
131                 </td>
132                 <td>
133                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
134                 </td>
135                 <td>
136                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
137                 </td>
138                 <td>
139                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
140                 </td>
141             </table>
142             <input class = "input" >
143         </div>
144     </body>
145 
146 </html>
147 tohtml(false):<html>
148     <head>
149         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
150         <title>百度</title>
151         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
152     </head>
153     <body>
154         <div  align = "center" class = "photo" >
155             <img src = "../image/baidu.PNG" >
156         </div>
157         <div align = "center" class = "body">
158             <table cellpadding="8">
159                 <td>
160                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
161                 </td>
162                 <td>
163                     <font color = "black">网页</font>
164                 </td>
165                 <td>
166                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
167                 </td>
168                 <td>
169                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
170                 </td>
171                 <td>
172                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
173                 </td>
174                 <td>
175                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
176                 </td>
177                 <td>
178                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
179                 </td>
180                 <td>
181                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
182                 </td>
183             </table>
184             <input class = "input" >
185         </div>
186     </body>
187 
188 </html>
189 toString:Tag (123[1,0],129[1,6]): html
190   Txt (129[1,6],132[2,1]): \n\t
191   Tag (132[2,1],138[2,7]): head
192     Txt (138[2,7],142[3,2]): \n\t\t
193     Tag (142[3,2],216[3,76]): meta http-equiv = "Content-Type" content = "text/ht...
194     Txt (216[3,76],220[4,2]): \n\t\t
195     Tag (220[4,2],227[4,9]): title
196       Txt (227[4,9],229[4,11]): 百度
197       End (229[4,11],237[4,19]): /title
198     Txt (237[4,19],241[5,2]): \n\t\t
199     Tag (241[5,2],302[5,63]): link href = "a_1.css" rel = "stylesheet" type = "te...
200     Txt (302[5,63],305[6,1]): \n\t
201     End (305[6,1],312[6,8]): /head
202   Txt (312[6,8],315[7,1]): \n\t
203   Tag (315[7,1],321[7,7]): body
204     Txt (321[7,7],325[8,2]): \n\t\t
205     Tag (325[8,2],365[8,42]): div  align = "center" class = "photo" 
206       Txt (365[8,42],370[9,3]): \n\t\t\t
207       Tag (370[9,3],403[9,36]): img src = "../image/baidu.PNG" 
208       Txt (403[9,36],407[10,2]): \n\t\t
209       End (407[10,2],413[10,8]): /div
210     Txt (413[10,8],417[11,2]): \n\t\t
211     Tag (417[11,2],454[11,39]): div align = "center" class = "body"
212       Txt (454[11,39],459[12,3]): \n\t\t\t
213       Tag (459[12,3],482[12,26]): table cellpadding="8"
214         Txt (482[12,26],488[13,4]): \n\t\t\t\t
215         Tag (488[13,4],492[13,8]): td
216           Txt (492[13,8],499[14,5]): \n\t\t\t\t\t
217           Tag (499[14,5],552[14,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
218             Txt (552[14,58],554[14,60]): 新闻
219             End (554[14,60],558[14,64]): /a
220           Txt (558[14,64],564[15,4]): \n\t\t\t\t
221           End (564[15,4],569[15,9]): /td
222         Txt (569[15,9],575[16,4]): \n\t\t\t\t
223         Tag (575[16,4],579[16,8]): td
224           Txt (579[16,8],586[17,5]): \n\t\t\t\t\t
225           Tag (586[17,5],608[17,27]): font color = "black"
226           Txt (608[17,27],610[17,29]): 网页
227           End (610[17,29],617[17,36]): /font
228           Txt (617[17,36],623[18,4]): \n\t\t\t\t
229           End (623[18,4],628[18,9]): /td
230         Txt (628[18,9],634[19,4]): \n\t\t\t\t
231         Tag (634[19,4],638[19,8]): td
232           Txt (638[19,8],645[20,5]): \n\t\t\t\t\t
233           Tag (645[20,5],698[20,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
234             Txt (698[20,58],700[20,60]): 贴吧
235             End (700[20,60],704[20,64]): /a
236           Txt (704[20,64],710[21,4]): \n\t\t\t\t
237           End (710[21,4],715[21,9]): /td
238         Txt (715[21,9],721[22,4]): \n\t\t\t\t
239         Tag (721[22,4],725[22,8]): td
240           Txt (725[22,8],732[23,5]): \n\t\t\t\t\t
241           Tag (732[23,5],785[23,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
242             Txt (785[23,58],787[23,60]): 知道
243             End (787[23,60],791[23,64]): /a
244           Txt (791[23,64],797[24,4]): \n\t\t\t\t
245           End (797[24,4],802[24,9]): /td
246         Txt (802[24,9],808[25,4]): \n\t\t\t\t
247         Tag (808[25,4],812[25,8]): td
248           Txt (812[25,8],819[26,5]): \n\t\t\t\t\t
249           Tag (819[26,5],872[26,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
250             Txt (872[26,58],874[26,60]): 音乐
251             End (874[26,60],878[26,64]): /a
252           Txt (878[26,64],884[27,4]): \n\t\t\t\t
253           End (884[27,4],889[27,9]): /td
254         Txt (889[27,9],895[28,4]): \n\t\t\t\t
255         Tag (895[28,4],899[28,8]): td
256           Txt (899[28,8],906[29,5]): \n\t\t\t\t\t
257           Tag (906[29,5],959[29,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
258             Txt (959[29,58],961[29,60]): 图片
259             End (961[29,60],965[29,64]): /a
260           Txt (965[29,64],971[30,4]): \n\t\t\t\t
261           End (971[30,4],976[30,9]): /td
262         Txt (976[30,9],982[31,4]): \n\t\t\t\t
263         Tag (982[31,4],986[31,8]): td
264           Txt (986[31,8],993[32,5]): \n\t\t\t\t\t
265           Tag (993[32,5],1046[32,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
266             Txt (1046[32,58],1048[32,60]): 视频
267             End (1048[32,60],1052[32,64]): /a
268           Txt (1052[32,64],1058[33,4]): \n\t\t\t\t
269           End (1058[33,4],1063[33,9]): /td
270         Txt (1063[33,9],1069[34,4]): \n\t\t\t\t
271         Tag (1069[34,4],1073[34,8]): td
272           Txt (1073[34,8],1080[35,5]): \n\t\t\t\t\t
273           Tag (1080[35,5],1133[35,58]): a href = "#" target = _blank title = "欢迎来到&#10百...
274             Txt (1133[35,58],1135[35,60]): 地图
275             End (1135[35,60],1139[35,64]): /a
276           Txt (1139[35,64],1145[36,4]): \n\t\t\t\t
277           End (1145[36,4],1150[36,9]): /td
278         Txt (1150[36,9],1155[37,3]): \n\t\t\t
279         End (1155[37,3],1163[37,11]): /table
280       Txt (1163[37,11],1168[38,3]): \n\t\t\t
281       Tag (1168[38,3],1192[38,27]): input class = "input" 
282       Txt (1192[38,27],1196[39,2]): \n\t\t
283       End (1196[39,2],1202[39,8]): /div
284     Txt (1202[39,8],1205[40,1]): \n\t
285     End (1205[40,1],1212[40,8]): /body
286   Txt (1212[40,8],1216[42,0]): \n\n
287   End (1216[42,0],1223[42,7]): /html
288 
289 ==============================
View Code

   对于第一个Node的内容,对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">,从这个输出结果中,也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag,如DOCTYPE,head和html,分别形成了一个最高层的Node节点(很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行,空格,Tab等都转换成了相应的Tag,所以就出现了这样的Node。虽然内容少但是级别高,呵呵)

  getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点,一是<head>标签中的Title内容是在plainText中的,可能在标题中可见的也算可见吧。另外就是象前面说的,HTML内容中的换行符什么的,也都成了plainText,这个逻辑上好像有点问题。

  另外可能大家发现toHtml,toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的,如果跟踪HTMLParser的代码就可以发现,Node的子类是AbstractNode,其中实现了toHtml()的代码,直接调用toHtml(false),而AbstractNode的三个子类RemarkNode,TagNode和TextNode中,toHtml(boolean verbatim)的实现中,都没有处理verbatim参数,所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理,简单使用toHtml就可以了。

HTML的Node类继承关系如下图(这个是从别的文章Copy的)

浅谈HtmlParser_第1张图片

他被组织成三棵树的森林,其中以<html>标签为根节点的树高度最大,网页的树状结构图如下:

浅谈HtmlParser_第2张图片

  html树中要特别注意的是每一个回车换行,HTMLParser会将他们看做一个节点处理。

  AbstractNodes是Node的直接子类,也是一个抽象类。它的三个直接子类实现是RemarkNode,用于保存注释。在输出结果的toString部分中可以看到有一个"Rem (345[6,2],356[6,13]): 这是注释",就是一个RemarkNode。TextNode也很简单,就是用户可见的文字信息。TagNode是最复杂的,包含了HTML语言中的所有标签,而且可以扩展(扩展 HTMLParser 对自定义标签的处理能力)。TagNode包含两类,一类是简单的Tag,实际就是不能包含其他Tag的标签,只能做叶子节点。另一类是CompositeTag,就是可以包含其他Tag,是分支节点

  HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。

(一)Filter类
  顾名思义,Filter就是对于结果进行过滤,取得需要的内容。HTMLParser在org.htmlparser.filters包之内一共定义了16个不同的Filter,也可以分为几类。
  判断类Filter:

TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter

   逻辑运算Filter:

AndFilter
NotFilter
OrFilter
XorFilter

  其他Filter:

NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter

 所有的Filter类都实现了org.htmlparser.NodeFilter接口。这个接口只有一个主要函数:boolean accept (Node node);

各个子类分别实现这个函数,用于判断输入的Node是否符合这个Filter的过滤条件,如果符合,返回true,否则返回false。

(二)判断类Filter
  2.1 TagNameFilter

  TabNameFilter是最容易理解的一个Filter,根据Tag的名字进行过滤。

 【源码:htmlparser_3.java】(此处只给出main方法的代码,其余代码同上)

 1     /*
 2      * main方法
 3      */
 4     public static void main(String[] args) {
 5         // String mContent=OpenFile("");
 6         try {
 7             Parser mParser = new Parser((HttpURLConnection) (new URL(
 8                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
 9             
10 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
11 //            mParser.visitAllNodesWith(mExtractingVisitor);
12 //            String textInPage = mExtractingVisitor.getExtractedText();
13 //            message(textInPage);
14             
15 //            for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {
16 //                Node node = i.nextNode();
17 //                message("getText:"+node.getText());
18 //                message("getPlainText:"+node.toPlainTextString());
19 //                message("toHtml:"+node.toHtml());
20 //                message("toHtml(true):"+node.toHtml(true));
21 //                message("tohtml(false):"+node.toHtml(false));
22 //                message("toString:"+node.toString());
23 //                message("==============================");
24 //            }
25             
26             NodeFilter mNodeFilter = new TagNameFilter("DIV");
27             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
28             if (mNodeFilter!=null) {
29                 for (int i = 0; i < mNodeList.size(); i++) {
30                     Node textNode = (Node)mNodeList.elementAt(i);
31                     message("getText:"+textNode.getText());
32                     message("===================================");
33                 }
34             }
35             
36         } catch (Exception e) {
37             // TODO: handle exception
38             e.printStackTrace();
39         }
40     }

 测试输出结果:

1 getText:div  align = "center" class = "photo" 
2 ===================================
3 getText:div align = "center" class = "body"
4 ===================================
View Code

可以看出文件中两个Div节点都被取出了。下面可以针对这两个DIV节点进行操作。

  2.2 HasChildFilter

  下面让我们看看HasChildFilter。刚刚看到这个Filter的时候,我想当然地认为这个Filter返回的是有Child的Tag。直接初始化了一个
  NodeFilter filter = new HasChildFilter();
  结果调用NodeList nodes = parser.extractAllNodesThatMatch(filter);的时候HasChildFilter内部直接发生NullPointerException。读了一下HasChildFilter的代码,才发现,实际HasChildFilter是返回有符合条件的子节点的节点,需要另外一个Filter作为过滤子节点的参数。缺省的构造函数虽然可以初始化,但是由于子节点的Filter是null,所以使用的时候发生了Exception。从这点来看,HTMLParser的代码还有很多可以优化的的地方。呵呵。

 修改代码:

 1     /*
 2      * main方法
 3      */
 4     public static void main(String[] args) {
 5         // String mContent=OpenFile("");
 6         try {
 7             Parser mParser = new Parser((HttpURLConnection) (new URL(
 8                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());            
 9             NodeFilter mInnerFilter = new TagNameFilter("DIV");
10             NodeFilter mNodeFilter = new HasChildFilter(mInnerFilter);
11             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
12             if (mNodeFilter!=null) {
13             for (int i = 0; i < mNodeList.size(); i++) {
14                 Node textNode = (Node)mNodeList.elementAt(i);
15                 message("getText:"+textNode.getText());
16                 message("===================================");
17             }
18         }
19             
20         } catch (Exception e) {
21             // TODO: handle exception
22             e.printStackTrace();
23         }
24     }

 

 测试输出结果:

1 getText:body
2 ===================================
View Code

 在此处可以看到,输出的是含有DIV子Tag的Tag节点。(body有子节点DIV“<div  align = "center" class = "photo" >”)

注意HasChildFilter还有一个构造函数:public HasChildFilter (NodeFilter filter, boolean recursive)

如果recursive是false,则只对第一级子节点进行过滤。比如前面的例子,body在第一级的子节点里就有DIV节点,所以匹配上了。如果我们用下面的方法调用:

NodeFilter filter = new HasChildFilter( innerFilter, true );

 测试输出结果:

1 getText:html
2 ===================================
3 getText:body
4 ===================================
View Code

 可以看到输出结果中多了一个html ,这个是整个HTML页面的节点(根节点),虽然这个节点下直接没有DIV节点,但是它的子节点body下面有DIV节点,所以它也被匹配上了。

  2.3 HasAttributeFilter

  HasAttributeFilter有3个构造函数:
  public HasAttributeFilter ();
  public HasAttributeFilter (String attribute);
  public HasAttributeFilter (String attribute, String value);
  这个Filter可以匹配出包含制定名字的属性,或者制定属性为指定值的节点。还是用例子说明比较容易。

 调用方法1:

1             NodeFilter mNodeFilter = new HasAttributeFilter();
2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 输出结果:

什么也没有输出

调用方法2:

1             NodeFilter mNodeFilter = new HasAttributeFilter("class");
2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 输出结果:

1 getText:div  align = "center" class = "photo" 
2 ===================================
3 getText:div align = "center" class = "body"
4 ===================================
5 getText:input class = "input" 
6 ===================================
View Code

 调用方法3:

1             NodeFilter mNodeFilter = new HasAttributeFilter("class","photo");
2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 输出结果:

1 getText:div  align = "center" class = "photo" 
2 ===================================
View Code

  2.4 其他判断列Filter
  HasParentFilter和HasSiblingFilter的功能与HasChildFilter类似,大家自己试一下就应该了解了。

  IsEqualFilter的构造函数参数是一个Node:
  public IsEqualFilter (Node node) {
    mNode = node;
  }
  accept函数也很简单:
  public boolean accept (Node node) {
    return (mNode == node);
  }
  不需要过多说明了。

 

(三)逻辑运算Filter

  前面介绍的都是简单的Filter,只能针对某种单一类型的条件进行过滤。HTMLParser支持对于简单类型的Filter进行组合,从而实现复杂的条件。原理和一般编程语言的逻辑运算是一样的。

  3.1 AndFilter

  AndFilter可以把两种Filter进行组合,只有同时满足条件的Node才会被过滤。
  测试代码:

1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
3 NodeFilter mNodeFilter = new AndFilter(mNodeFilterLeft, mNodeFilterRight);
4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 

测试输出结果:

1 getText:div  align = "center" class = "photo" 
2 ===================================
3 getText:div align = "center" class = "body"
4 ===================================
View Code

 

  3.2 OrFilter
  把前面的AndFilter换成OrFilter

   测试代码:

1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
3 NodeFilter mNodeFilter = new OrFilter(mNodeFilterLeft, mNodeFilterRight);
4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 

  测试输出结果:

1 getText:div  align = "center" class = "photo" 
2 ===================================
3 getText:div align = "center" class = "body"
4 ===================================
5 getText:input class = "input" 
6 ===================================
View Code

 

  3.3 NotFilter
  把前面的AndFilter换成NotFilter

  测试代码:

1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
3 NodeFilter mNodeFilter = new NotFilter(new OrFilter(mNodeFilterLeft,mNodeFilterRight));
4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 

  测试输出结果:

  1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  2 ===================================
  3 getText:
  4 
  5 ===================================
  6 getText:html
  7 ===================================
  8 getText:
  9     
 10 ===================================
 11 getText:head
 12 ===================================
 13 getText:
 14         
 15 ===================================
 16 getText:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
 17 ===================================
 18 getText:
 19         
 20 ===================================
 21 getText:title
 22 ===================================
 23 getText:百度
 24 ===================================
 25 getText:/title
 26 ===================================
 27 getText:
 28         
 29 ===================================
 30 getText:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
 31 ===================================
 32 getText:
 33     
 34 ===================================
 35 getText:/head
 36 ===================================
 37 getText:
 38     
 39 ===================================
 40 getText:body
 41 ===================================
 42 getText:
 43         
 44 ===================================
 45 getText:
 46             
 47 ===================================
 48 getText:img src = "../image/baidu.PNG" 
 49 ===================================
 50 getText:
 51         
 52 ===================================
 53 getText:/div
 54 ===================================
 55 getText:
 56         
 57 ===================================
 58 getText:
 59             
 60 ===================================
 61 getText:table cellpadding="8"
 62 ===================================
 63 getText:
 64                 
 65 ===================================
 66 getText:td
 67 ===================================
 68 getText:
 69                     
 70 ===================================
 71 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 72 ===================================
 73 getText:新闻
 74 ===================================
 75 getText:/a
 76 ===================================
 77 getText:
 78                 
 79 ===================================
 80 getText:/td
 81 ===================================
 82 getText:
 83                 
 84 ===================================
 85 getText:td
 86 ===================================
 87 getText:
 88                     
 89 ===================================
 90 getText:font color = "black"
 91 ===================================
 92 getText:网页
 93 ===================================
 94 getText:/font
 95 ===================================
 96 getText:
 97                 
 98 ===================================
 99 getText:/td
100 ===================================
101 getText:
102                 
103 ===================================
104 getText:td
105 ===================================
106 getText:
107                     
108 ===================================
109 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
110 ===================================
111 getText:贴吧
112 ===================================
113 getText:/a
114 ===================================
115 getText:
116                 
117 ===================================
118 getText:/td
119 ===================================
120 getText:
121                 
122 ===================================
123 getText:td
124 ===================================
125 getText:
126                     
127 ===================================
128 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
129 ===================================
130 getText:知道
131 ===================================
132 getText:/a
133 ===================================
134 getText:
135                 
136 ===================================
137 getText:/td
138 ===================================
139 getText:
140                 
141 ===================================
142 getText:td
143 ===================================
144 getText:
145                     
146 ===================================
147 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
148 ===================================
149 getText:音乐
150 ===================================
151 getText:/a
152 ===================================
153 getText:
154                 
155 ===================================
156 getText:/td
157 ===================================
158 getText:
159                 
160 ===================================
161 getText:td
162 ===================================
163 getText:
164                     
165 ===================================
166 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
167 ===================================
168 getText:图片
169 ===================================
170 getText:/a
171 ===================================
172 getText:
173                 
174 ===================================
175 getText:/td
176 ===================================
177 getText:
178                 
179 ===================================
180 getText:td
181 ===================================
182 getText:
183                     
184 ===================================
185 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
186 ===================================
187 getText:视频
188 ===================================
189 getText:/a
190 ===================================
191 getText:
192                 
193 ===================================
194 getText:/td
195 ===================================
196 getText:
197                 
198 ===================================
199 getText:td
200 ===================================
201 getText:
202                     
203 ===================================
204 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
205 ===================================
206 getText:地图
207 ===================================
208 getText:/a
209 ===================================
210 getText:
211                 
212 ===================================
213 getText:/td
214 ===================================
215 getText:
216             
217 ===================================
218 getText:/table
219 ===================================
220 getText:
221             
222 ===================================
223 getText:
224         
225 ===================================
226 getText:/div
227 ===================================
228 getText:
229     
230 ===================================
231 getText:/body
232 ===================================
233 getText:
234 
235 
236 ===================================
237 getText:/html
238 ===================================
View Code

 

  3.4 XorFilter(暂未实现)
  把前面的AndFilter换成NotFilter

  测试代码:……

  测试输出结果:……

(四)其他Filter
  4.1 NodeClassFilter

  这个Filter用于判断节点类型是否是某个特定的Node类型。在上面中我们已经了解了Node的不同类型,这个Filter就可以针对类型进行过滤。

  测试代码:

  测试输出结果:

  4.2 StringFilter

  这个Filter用于过滤显示字符串中包含制定内容的Tag。注意是可显示的字符串,不可显示的字符串中的内容(例如注释,链接等等)不会被显示。

  测试代码:

1 NodeFilter mNodeFilter = new StringFilter("贴吧");
2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 

  测试输出结果:

1 getText:贴吧
2 ===================================
View Code

 

  4.3 LinkStringFilter

  这个Filter用于判断链接中是否包含某个特定的字符串,可以用来过滤出指向某个特定网站的链接。

  测试代码:

1 NodeFilter mNodeFilter = new LinkStringFilter("http://tieba.baidu.com/");
2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

 

  测试输出结果:(此处需要修改html例子的代码,修改后为:【<a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>】)

1 getText:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站"
2 ===================================
View Code

 

  4.4 其他几个Filter

  其他几个Filter也是根据字符串对不同的域进行判断,与前面这些的区别主要就是支持正则表达式。这个不在本文的讨论范围以内,大家可以自己实验一下。

  HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。
  下面介绍使用Visitor访问内容的方法。

  5.1 NodeVisitor

  从简单方面的理解,Filter是根据某种条件过滤取出需要的Node再进行处理。Visitor则是遍历内容树的每一个节点,对于符合条件的节点进行处理。实际的结果异曲同工,两种不同的方法可以达到相同的结果。
  下面是一个最常见的NodeVisitro的例子。

  测试代码:

 1     public static void main(String[] args) {
 2         // TODO Auto-generated method stub
 3         try {
 4             
 5             Parser mParser = new Parser(
 6                     (HttpURLConnection) (new URL(
 7                             "http://127.0.0.1/HtmlParser/index.html"))
 8                             .openConnection());
 9             NodeVisitor mNodeVisitor = new NodeVisitor(false, false) {
10                 @Override
11                 public void visitTag(Tag tag) {
12                     // TODO Auto-generated method stub
13                     message("This is Tag:" + tag.getText());
14                 }
15 
16                 @Override
17                 public void visitStringNode(Text string) {
18                     // TODO Auto-generated method stub
19                     message("This is Text:" + string);
20                 }
21 
22                 @Override
23                 public void visitRemarkNode(Remark remark) {
24                     // TODO Auto-generated method stub
25                     message("This is Remark:" + remark.getText());
26                 }
27 
28                 @Override
29                 public void beginParsing() {
30                     // TODO Auto-generated method stub
31                     message("begin Parsing");
32                 }
33 
34                 @Override
35                 public void visitEndTag(Tag tag) {
36                     // TODO Auto-generated method stub
37                     message("visitEndTag:" + tag.getText());
38                 }
39 
40                 @Override
41                 public void finishedParsing() {
42                     // TODO Auto-generated method stub
43                     message("finishedParsing!");
44                 }
45             };
46             mParser.visitAllNodesWith(mNodeVisitor);
47         } catch (Exception e) {
48             // TODO: handle exception
49         }
50     }

  测试输出结果:

1 begin Parsing
2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
3 This is Text:Txt (121[0,121],123[1,0]): \n
4 finishedParsing!
View Code

  可以看到,开始遍历所以的节点以前,beginParsing先被调用,然后处理的是中间的Node,最后在结束遍历以前,finishParsing被调用。因为我设置的 recurseChildren和recurseSelf都是false,所以Visitor没有访问子节点也没有访问根节点的内容。中间输出的两个\n就是我们在前面初始化Parser 中讨论过的最高层的那两个换行。

我们先把recurseSelf设置成true,看看会发生什么。

1 NodeVisitor visitor = new NodeVisitor( false, true) 

   输出结果 :

1 begin Parsing
2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
3 This is Text:Txt (121[0,121],123[1,0]): \n
4 This is Tag:html
5 finishedParsing!
View Code

  可以看到,HTML页面的第一层节点都被调用了。

  我们再用下面的方法调用看看:

1 NodeVisitor mNodeVisitor = new NodeVisitor(true, false)

 

  输出结果:

 1 begin Parsing
 2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
 3 This is Text:Txt (121[0,121],123[1,0]): \n
 4 This is Text:Txt (129[1,6],132[2,1]): \n\t
 5 This is Text:Txt (138[2,7],142[3,2]): \n\t\t
 6 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
 7 This is Text:Txt (216[3,76],220[4,2]): \n\t\t
 8 This is Remark:<title>百度</title>
 9 This is Text:Txt (244[4,26],248[5,2]): \n\t\t
10 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
11 This is Text:Txt (309[5,63],312[6,1]): \n\t
12 visitEndTag:/head
13 This is Text:Txt (319[6,8],322[7,1]): \n\t
14 This is Text:Txt (328[7,7],332[8,2]): \n\t\t
15 This is Text:Txt (372[8,42],377[9,3]): \n\t\t\t
16 This is Tag:img src = "../image/baidu.PNG" 
17 This is Text:Txt (410[9,36],414[10,2]): \n\t\t
18 visitEndTag:/div
19 This is Text:Txt (420[10,8],424[11,2]): \n\t\t
20 This is Text:Txt (461[11,39],466[12,3]): \n\t\t\t
21 This is Text:Txt (489[12,26],495[13,4]): \n\t\t\t\t
22 This is Text:Txt (499[13,8],506[14,5]): \n\t\t\t\t\t
23 This is Text:Txt (559[14,58],561[14,60]): 新闻
24 visitEndTag:/a
25 This is Text:Txt (565[14,64],571[15,4]): \n\t\t\t\t
26 visitEndTag:/td
27 This is Text:Txt (576[15,9],582[16,4]): \n\t\t\t\t
28 This is Text:Txt (586[16,8],593[17,5]): \n\t\t\t\t\t
29 This is Tag:font color = "black"
30 This is Text:Txt (615[17,27],617[17,29]): 网页
31 visitEndTag:/font
32 This is Text:Txt (624[17,36],630[18,4]): \n\t\t\t\t
33 visitEndTag:/td
34 This is Text:Txt (635[18,9],641[19,4]): \n\t\t\t\t
35 This is Text:Txt (645[19,8],652[20,5]): \n\t\t\t\t\t
36 This is Text:Txt (727[20,80],729[20,82]): 贴吧
37 visitEndTag:/a
38 This is Text:Txt (733[20,86],739[21,4]): \n\t\t\t\t
39 visitEndTag:/td
40 This is Text:Txt (744[21,9],750[22,4]): \n\t\t\t\t
41 This is Text:Txt (754[22,8],761[23,5]): \n\t\t\t\t\t
42 This is Text:Txt (814[23,58],816[23,60]): 知道
43 visitEndTag:/a
44 This is Text:Txt (820[23,64],826[24,4]): \n\t\t\t\t
45 visitEndTag:/td
46 This is Text:Txt (831[24,9],837[25,4]): \n\t\t\t\t
47 This is Text:Txt (841[25,8],848[26,5]): \n\t\t\t\t\t
48 This is Text:Txt (901[26,58],903[26,60]): 音乐
49 visitEndTag:/a
50 This is Text:Txt (907[26,64],913[27,4]): \n\t\t\t\t
51 visitEndTag:/td
52 This is Text:Txt (918[27,9],924[28,4]): \n\t\t\t\t
53 This is Text:Txt (928[28,8],935[29,5]): \n\t\t\t\t\t
54 This is Text:Txt (988[29,58],990[29,60]): 图片
55 visitEndTag:/a
56 This is Text:Txt (994[29,64],1000[30,4]): \n\t\t\t\t
57 visitEndTag:/td
58 This is Text:Txt (1005[30,9],1011[31,4]): \n\t\t\t\t
59 This is Text:Txt (1015[31,8],1022[32,5]): \n\t\t\t\t\t
60 This is Text:Txt (1075[32,58],1077[32,60]): 视频
61 visitEndTag:/a
62 This is Text:Txt (1081[32,64],1087[33,4]): \n\t\t\t\t
63 visitEndTag:/td
64 This is Text:Txt (1092[33,9],1098[34,4]): \n\t\t\t\t
65 This is Text:Txt (1102[34,8],1109[35,5]): \n\t\t\t\t\t
66 This is Text:Txt (1162[35,58],1164[35,60]): 地图
67 visitEndTag:/a
68 This is Text:Txt (1168[35,64],1174[36,4]): \n\t\t\t\t
69 visitEndTag:/td
70 This is Text:Txt (1179[36,9],1184[37,3]): \n\t\t\t
71 visitEndTag:/table
72 This is Text:Txt (1192[37,11],1197[38,3]): \n\t\t\t
73 This is Tag:input class = "input" 
74 This is Text:Txt (1221[38,27],1225[39,2]): \n\t\t
75 visitEndTag:/div
76 This is Text:Txt (1231[39,8],1234[40,1]): \n\t
77 visitEndTag:/body
78 This is Text:Txt (1241[40,8],1245[42,0]): \n\n
79 visitEndTag:/html
80 finishedParsing!
View Code

 

  可以看到,所有的子节点都出现了,除了刚刚例子里面的两个最上层节点This is Tag:head和This is Tag:html xmlns="http://www.w3.org/1999/xhtml"。

  想让它们都出来,只需要

1 NodeVisitor mNodeVisitor = new NodeVisitor(true, true)

   输出结果:

  1 begin Parsing
  2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  3 This is Text:Txt (121[0,121],123[1,0]): \n
  4 This is Tag:html
  5 This is Text:Txt (129[1,6],132[2,1]): \n\t
  6 This is Tag:head
  7 This is Text:Txt (138[2,7],142[3,2]): \n\t\t
  8 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
  9 This is Text:Txt (216[3,76],220[4,2]): \n\t\t
 10 This is Remark:<title>百度</title>
 11 This is Text:Txt (244[4,26],248[5,2]): \n\t\t
 12 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
 13 This is Text:Txt (309[5,63],312[6,1]): \n\t
 14 visitEndTag:/head
 15 This is Text:Txt (319[6,8],322[7,1]): \n\t
 16 This is Tag:body
 17 This is Text:Txt (328[7,7],332[8,2]): \n\t\t
 18 This is Tag:div  align = "center" class = "photo" 
 19 This is Text:Txt (372[8,42],377[9,3]): \n\t\t\t
 20 This is Tag:img src = "../image/baidu.PNG" 
 21 This is Text:Txt (410[9,36],414[10,2]): \n\t\t
 22 visitEndTag:/div
 23 This is Text:Txt (420[10,8],424[11,2]): \n\t\t
 24 This is Tag:div align = "center" class = "body"
 25 This is Text:Txt (461[11,39],466[12,3]): \n\t\t\t
 26 This is Tag:table cellpadding="8"
 27 This is Text:Txt (489[12,26],495[13,4]): \n\t\t\t\t
 28 This is Tag:td
 29 This is Text:Txt (499[13,8],506[14,5]): \n\t\t\t\t\t
 30 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 31 This is Text:Txt (559[14,58],561[14,60]): 新闻
 32 visitEndTag:/a
 33 This is Text:Txt (565[14,64],571[15,4]): \n\t\t\t\t
 34 visitEndTag:/td
 35 This is Text:Txt (576[15,9],582[16,4]): \n\t\t\t\t
 36 This is Tag:td
 37 This is Text:Txt (586[16,8],593[17,5]): \n\t\t\t\t\t
 38 This is Tag:font color = "black"
 39 This is Text:Txt (615[17,27],617[17,29]): 网页
 40 visitEndTag:/font
 41 This is Text:Txt (624[17,36],630[18,4]): \n\t\t\t\t
 42 visitEndTag:/td
 43 This is Text:Txt (635[18,9],641[19,4]): \n\t\t\t\t
 44 This is Tag:td
 45 This is Text:Txt (645[19,8],652[20,5]): \n\t\t\t\t\t
 46 This is Tag:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站"
 47 This is Text:Txt (727[20,80],729[20,82]): 贴吧
 48 visitEndTag:/a
 49 This is Text:Txt (733[20,86],739[21,4]): \n\t\t\t\t
 50 visitEndTag:/td
 51 This is Text:Txt (744[21,9],750[22,4]): \n\t\t\t\t
 52 This is Tag:td
 53 This is Text:Txt (754[22,8],761[23,5]): \n\t\t\t\t\t
 54 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 55 This is Text:Txt (814[23,58],816[23,60]): 知道
 56 visitEndTag:/a
 57 This is Text:Txt (820[23,64],826[24,4]): \n\t\t\t\t
 58 visitEndTag:/td
 59 This is Text:Txt (831[24,9],837[25,4]): \n\t\t\t\t
 60 This is Tag:td
 61 This is Text:Txt (841[25,8],848[26,5]): \n\t\t\t\t\t
 62 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 63 This is Text:Txt (901[26,58],903[26,60]): 音乐
 64 visitEndTag:/a
 65 This is Text:Txt (907[26,64],913[27,4]): \n\t\t\t\t
 66 visitEndTag:/td
 67 This is Text:Txt (918[27,9],924[28,4]): \n\t\t\t\t
 68 This is Tag:td
 69 This is Text:Txt (928[28,8],935[29,5]): \n\t\t\t\t\t
 70 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 71 This is Text:Txt (988[29,58],990[29,60]): 图片
 72 visitEndTag:/a
 73 This is Text:Txt (994[29,64],1000[30,4]): \n\t\t\t\t
 74 visitEndTag:/td
 75 This is Text:Txt (1005[30,9],1011[31,4]): \n\t\t\t\t
 76 This is Tag:td
 77 This is Text:Txt (1015[31,8],1022[32,5]): \n\t\t\t\t\t
 78 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 79 This is Text:Txt (1075[32,58],1077[32,60]): 视频
 80 visitEndTag:/a
 81 This is Text:Txt (1081[32,64],1087[33,4]): \n\t\t\t\t
 82 visitEndTag:/td
 83 This is Text:Txt (1092[33,9],1098[34,4]): \n\t\t\t\t
 84 This is Tag:td
 85 This is Text:Txt (1102[34,8],1109[35,5]): \n\t\t\t\t\t
 86 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
 87 This is Text:Txt (1162[35,58],1164[35,60]): 地图
 88 visitEndTag:/a
 89 This is Text:Txt (1168[35,64],1174[36,4]): \n\t\t\t\t
 90 visitEndTag:/td
 91 This is Text:Txt (1179[36,9],1184[37,3]): \n\t\t\t
 92 visitEndTag:/table
 93 This is Text:Txt (1192[37,11],1197[38,3]): \n\t\t\t
 94 This is Tag:input class = "input" 
 95 This is Text:Txt (1221[38,27],1225[39,2]): \n\t\t
 96 visitEndTag:/div
 97 This is Text:Txt (1231[39,8],1234[40,1]): \n\t
 98 visitEndTag:/body
 99 This is Text:Txt (1241[40,8],1245[42,0]): \n\n
100 visitEndTag:/html
101 finishedParsing!
View Code

  哈哈,这下调用清楚了,大家在需要处理的地方增加自己的代码好了。

 

  5.2 其他Visitor

……

到此,个人感觉与htmlparser的缘分已尽!下一步,进军JSoup!!!

 ===========================参考网址===========================

http://www.blogjava.net/amigoxie/archive/2008/01/18/176200.html

http://www.cnblogs.com/loveyakamoz/archive/2011/07/27/2118937.html

http://blog.csdn.net/witsmakemen/article/details/8778979

 ===========================参考网址===========================

你可能感兴趣的:(HtmlParser)