使用Heritrix抓取到自己所需的网页后,还需要对网页中的内容进行分类等操作,这个时候就需要用到htmlparser,但是使用htmlparser并不是那么容易!因为相关的文档比较少,很多更能需要开发者自己去摸索,去发掘!
不过这里给大家提供一个比较好的网站(htmlparser的API):http://tool.oschina.net/apidocs/apidoc?api=HTMLParser,这个API是英文版的,英语不好的这时就要逼迫自己看下去了。
HTMLParser的核心模块是org.htmlparser.Parser类,这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数:
public Parser (); public Parser (Lexer lexer, ParserFeedback fb); public Parser (URLConnection connection, ParserFeedback fb) throws ParserException; public Parser (String resource, ParserFeedback feedback) throws ParserException; public Parser (String resource) throws ParserException; public Parser (Lexer lexer); public Parser (URLConnection connection) throws ParserException;
和一个静态类
public static Parser createParser (String html, String charset);
对于大多数使用者来说,使用最多的是通过一个URLConnection或者一个保存有网页内容的字符串来初始化Parser,或者使用静态函数来生成一个Parser对象。ParserFeedback的代码很简单,是针对调试和跟踪分析过程的,一般不需要改变。而使用Lexer则是一个相对比较高级的话题,放到以后再讨论吧。
这里比较有趣的一点是,如果需要设置页面的编码方式的话,不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说,好像这是应该用得比较多的一个方法。
下面是初始化Parser的例子(通过打开一个网页的URL,中间的OpenFile方法是在打开一个本地的html文件时使用的)。
【加载的网页文件:index.html】
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/> <title>百度</title> <link href = "a_1.css" rel = "stylesheet" type = "text/css"/> </head> <body> <div align = "center" class = "photo" > <img src = "../image/baidu.PNG" > </div> <div align = "center" class = "body"> <table cellpadding="8"> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">新闻</a> </td> <td> <font color = "black">网页</font> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">贴吧</a> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">知道</a> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">音乐</a> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">图片</a> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">视频</a> </td> <td> <a href = "#" target = _blank title = "欢迎来到
百度网站">地图</a> </td> </table> <input class = "input" > </div> </body> </html>
【源码:htmlparser_1.java】
1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileInputStream; 4 import java.io.InputStreamReader; 5 import java.net.HttpURLConnection; 6 import java.net.URL; 7 import org.htmlparser.Parser; 8 import org.htmlparser.visitors.TextExtractingVisitor; 9 10 public class Main { 11 private static String ENCODE = "GBK"; 12 private static void message(String msg) { 13 // TODO Auto-generated method stub 14 try { 15 System.out.println(new String(msg.getBytes(ENCODE), System 16 .getProperty("file.encoding"))); 17 } catch (Exception e) { 18 // TODO: handle exception 19 e.printStackTrace(); 20 } 21 } 22 23 /* 24 * 打开一个文件 25 */ 26 public static String OpenFile(String FileName) { 27 try { 28 File mFile = new File(FileName); 29 FileInputStream mFileInputStream = new FileInputStream(mFile); 30 InputStreamReader mInputStreamReader = new InputStreamReader( 31 mFileInputStream, ENCODE); 32 BufferedReader mBufferedReader = new BufferedReader( 33 mInputStreamReader); 34 String mContent = ""; 35 String mTemp = ""; 36 while ((mTemp = mBufferedReader.readLine()) != null) { 37 mContent += mTemp + "\n"; 38 } 39 mBufferedReader.close(); 40 } catch (Exception e) { 41 // TODO: handle exception 42 e.printStackTrace(); 43 return ""; 44 } 45 return FileName; 46 } 47 48 /* 49 * main方法 50 */ 51 public static void main(String[] args) { 52 // String mContent=OpenFile(""); 53 try { 54 Parser mParser = new Parser((HttpURLConnection) (new URL( 55 "http://127.0.0.1/HtmlParser/index.html")).openConnection()); 56 TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor(); 57 mParser.visitAllNodesWith(mExtractingVisitor); 58 String textInPage = mExtractingVisitor.getExtractedText(); 59 message(textInPage); 60 } catch (Exception e) { 61 // TODO: handle exception 62 e.printStackTrace(); 63 } 64 } 65 66 }
测试输出结果:
1 2 3 百度 4 5 6 7 8 9 10 11 12 13 新闻 14 15 16 网页 17 18 19 贴吧 20 21 22 知道 23 24 25 音乐 26 27 28 图片 29 30 31 视频 32 33 34 地图 35 36 37 38 39
HTMLParser将解析过的信息保存为一个树的结构。Node是信息保存的数据类型基础。
请看Node的定义:
public interface Node extends Cloneable;
Node中包含的方法有几类:
对于树型结构进行遍历的函数,这些函数最容易理解:
Node getParent ():取得父节点
NodeList getChildren ():取得子节点的列表
Node getFirstChild ():取得第一个子节点
Node getLastChild ():取得最后一个子节点
Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了)
Node getNextSibling ():取得下一个兄弟节点
取得Node内容的函数:
String getText ():取得文本 String toPlainTextString():取得纯文本信息。 String toHtml () :取得HTML信息(原始HTML) String toHtml (boolean verbatim):取得HTML信息(原始HTML) String toString ():取得字符串信息(原始HTML) Page getPage ():取得这个Node对应的Page对象 int getStartPosition ():取得这个Node在HTML页面中的起始位置 int getEndPosition ():取得这个Node在HTML页面中的结束位置
用于Filter过滤的函数:
void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。
用于Visitor遍历的函数:
void accept (NodeVisitor visitor):对这个Node应用visitor
用于修改内容的函数,这类用得比较少:
void setPage (Page page):设置这个Node对应的Page对象 void setText (String text):设置文本 void setChildren (NodeList children):设置子节点列表
其他函数:
void doSemanticAction (): 执行这个Node对应的操作(只有少数Tag有对应的操作) Object clone (): 接口Clone的抽象函数。
实际我们用HTMLParser最多的是处理HTML页面,Filter或Visitor相关的函数是必须的,然后第一类和第二类函数是用得最多的。第一类函数比较容易理解,下面用例子说明一下第二类函数。
【源码:htmlparser_2.java】
1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileInputStream; 4 import java.io.InputStreamReader; 5 import java.net.HttpURLConnection; 6 import java.net.URL; 7 import org.htmlparser.Node; 8 import org.htmlparser.Parser; 9 import org.htmlparser.util.NodeIterator; 10 import org.htmlparser.visitors.TextExtractingVisitor; 11 import org.omg.CosNaming.NamingContextPackage.NotEmpty; 12 13 public class Main { 14 private static String ENCODE = "utf-8"; 15 private static void message(String msg) { 16 // TODO Auto-generated method stub 17 try { 18 System.out.println(new String(msg.getBytes(ENCODE), System 19 .getProperty("file.encoding"))); 20 } catch (Exception e) { 21 // TODO: handle exception 22 e.printStackTrace(); 23 } 24 } 25 26 /* 27 * 打开一个文件 28 */ 29 public static String OpenFile(String FileName) { 30 try { 31 File mFile = new File(FileName); 32 FileInputStream mFileInputStream = new FileInputStream(mFile); 33 InputStreamReader mInputStreamReader = new InputStreamReader( 34 mFileInputStream, ENCODE); 35 BufferedReader mBufferedReader = new BufferedReader( 36 mInputStreamReader); 37 String mContent = ""; 38 String mTemp = ""; 39 while ((mTemp = mBufferedReader.readLine()) != null) { 40 mContent += mTemp + "\n"; 41 } 42 mBufferedReader.close(); 43 } catch (Exception e) { 44 // TODO: handle exception 45 e.printStackTrace(); 46 return ""; 47 } 48 return FileName; 49 } 50 51 /* 52 * main方法 53 */ 54 public static void main(String[] args) { 55 // String mContent=OpenFile(""); 56 try { 57 Parser mParser = new Parser((HttpURLConnection) (new URL( 58 "http://127.0.0.1/HtmlParser/index.html")).openConnection()); 59 // TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor(); 60 // mParser.visitAllNodesWith(mExtractingVisitor); 61 // String textInPage = mExtractingVisitor.getExtractedText(); 62 // message(textInPage); 63 64 for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) { 65 Node node = i.nextNode(); 66 message("getText:"+node.getText()); 67 message("getPlainText:"+node.toPlainTextString()); 68 message("toHtml:"+node.toHtml()); 69 message("toHtml(true):"+node.toHtml(true)); 70 message("tohtml(false):"+node.toHtml(false)); 71 message("toString:"+node.toString()); 72 message("=============================="); 73 } 74 } catch (Exception e) { 75 // TODO: handle exception 76 e.printStackTrace(); 77 } 78 } 79 }
测试输出结果:
1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 2 getPlainText: 3 toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 4 toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 5 tohtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 6 toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121 7 ============================== 8 getText: 9 10 getPlainText: 11 12 toHtml: 13 14 toHtml(true): 15 16 tohtml(false): 17 18 toString:Txt (121[0,121],123[1,0]): \n 19 ============================== 20 getText:html 21 getPlainText: 22 23 24 百度 25 26 27 28 29 30 31 32 33 34 新闻 35 36 37 网页 38 39 40 贴吧 41 42 43 知道 44 45 46 音乐 47 48 49 图片 50 51 52 视频 53 54 55 地图 56 57 58 59 60 61 62 63 toHtml:<html> 64 <head> 65 <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/> 66 <title>百度</title> 67 <link href = "a_1.css" rel = "stylesheet" type = "text/css"/> 68 </head> 69 <body> 70 <div align = "center" class = "photo" > 71 <img src = "../image/baidu.PNG" > 72 </div> 73 <div align = "center" class = "body"> 74 <table cellpadding="8"> 75 <td> 76 <a href = "#" target = _blank title = "欢迎来到
百度网站">新闻</a> 77 </td> 78 <td> 79 <font color = "black">网页</font> 80 </td> 81 <td> 82 <a href = "#" target = _blank title = "欢迎来到
百度网站">贴吧</a> 83 </td> 84 <td> 85 <a href = "#" target = _blank title = "欢迎来到
百度网站">知道</a> 86 </td> 87 <td> 88 <a href = "#" target = _blank title = "欢迎来到
百度网站">音乐</a> 89 </td> 90 <td> 91 <a href = "#" target = _blank title = "欢迎来到
百度网站">图片</a> 92 </td> 93 <td> 94 <a href = "#" target = _blank title = "欢迎来到
百度网站">视频</a> 95 </td> 96 <td> 97 <a href = "#" target = _blank title = "欢迎来到
百度网站">地图</a> 98 </td> 99 </table> 100 <input class = "input" > 101 </div> 102 </body> 103 104 </html> 105 toHtml(true):<html> 106 <head> 107 <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/> 108 <title>百度</title> 109 <link href = "a_1.css" rel = "stylesheet" type = "text/css"/> 110 </head> 111 <body> 112 <div align = "center" class = "photo" > 113 <img src = "../image/baidu.PNG" > 114 </div> 115 <div align = "center" class = "body"> 116 <table cellpadding="8"> 117 <td> 118 <a href = "#" target = _blank title = "欢迎来到
百度网站">新闻</a> 119 </td> 120 <td> 121 <font color = "black">网页</font> 122 </td> 123 <td> 124 <a href = "#" target = _blank title = "欢迎来到
百度网站">贴吧</a> 125 </td> 126 <td> 127 <a href = "#" target = _blank title = "欢迎来到
百度网站">知道</a> 128 </td> 129 <td> 130 <a href = "#" target = _blank title = "欢迎来到
百度网站">音乐</a> 131 </td> 132 <td> 133 <a href = "#" target = _blank title = "欢迎来到
百度网站">图片</a> 134 </td> 135 <td> 136 <a href = "#" target = _blank title = "欢迎来到
百度网站">视频</a> 137 </td> 138 <td> 139 <a href = "#" target = _blank title = "欢迎来到
百度网站">地图</a> 140 </td> 141 </table> 142 <input class = "input" > 143 </div> 144 </body> 145 146 </html> 147 tohtml(false):<html> 148 <head> 149 <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/> 150 <title>百度</title> 151 <link href = "a_1.css" rel = "stylesheet" type = "text/css"/> 152 </head> 153 <body> 154 <div align = "center" class = "photo" > 155 <img src = "../image/baidu.PNG" > 156 </div> 157 <div align = "center" class = "body"> 158 <table cellpadding="8"> 159 <td> 160 <a href = "#" target = _blank title = "欢迎来到
百度网站">新闻</a> 161 </td> 162 <td> 163 <font color = "black">网页</font> 164 </td> 165 <td> 166 <a href = "#" target = _blank title = "欢迎来到
百度网站">贴吧</a> 167 </td> 168 <td> 169 <a href = "#" target = _blank title = "欢迎来到
百度网站">知道</a> 170 </td> 171 <td> 172 <a href = "#" target = _blank title = "欢迎来到
百度网站">音乐</a> 173 </td> 174 <td> 175 <a href = "#" target = _blank title = "欢迎来到
百度网站">图片</a> 176 </td> 177 <td> 178 <a href = "#" target = _blank title = "欢迎来到
百度网站">视频</a> 179 </td> 180 <td> 181 <a href = "#" target = _blank title = "欢迎来到
百度网站">地图</a> 182 </td> 183 </table> 184 <input class = "input" > 185 </div> 186 </body> 187 188 </html> 189 toString:Tag (123[1,0],129[1,6]): html 190 Txt (129[1,6],132[2,1]): \n\t 191 Tag (132[2,1],138[2,7]): head 192 Txt (138[2,7],142[3,2]): \n\t\t 193 Tag (142[3,2],216[3,76]): meta http-equiv = "Content-Type" content = "text/ht... 194 Txt (216[3,76],220[4,2]): \n\t\t 195 Tag (220[4,2],227[4,9]): title 196 Txt (227[4,9],229[4,11]): 百度 197 End (229[4,11],237[4,19]): /title 198 Txt (237[4,19],241[5,2]): \n\t\t 199 Tag (241[5,2],302[5,63]): link href = "a_1.css" rel = "stylesheet" type = "te... 200 Txt (302[5,63],305[6,1]): \n\t 201 End (305[6,1],312[6,8]): /head 202 Txt (312[6,8],315[7,1]): \n\t 203 Tag (315[7,1],321[7,7]): body 204 Txt (321[7,7],325[8,2]): \n\t\t 205 Tag (325[8,2],365[8,42]): div align = "center" class = "photo" 206 Txt (365[8,42],370[9,3]): \n\t\t\t 207 Tag (370[9,3],403[9,36]): img src = "../image/baidu.PNG" 208 Txt (403[9,36],407[10,2]): \n\t\t 209 End (407[10,2],413[10,8]): /div 210 Txt (413[10,8],417[11,2]): \n\t\t 211 Tag (417[11,2],454[11,39]): div align = "center" class = "body" 212 Txt (454[11,39],459[12,3]): \n\t\t\t 213 Tag (459[12,3],482[12,26]): table cellpadding="8" 214 Txt (482[12,26],488[13,4]): \n\t\t\t\t 215 Tag (488[13,4],492[13,8]): td 216 Txt (492[13,8],499[14,5]): \n\t\t\t\t\t 217 Tag (499[14,5],552[14,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 218 Txt (552[14,58],554[14,60]): 新闻 219 End (554[14,60],558[14,64]): /a 220 Txt (558[14,64],564[15,4]): \n\t\t\t\t 221 End (564[15,4],569[15,9]): /td 222 Txt (569[15,9],575[16,4]): \n\t\t\t\t 223 Tag (575[16,4],579[16,8]): td 224 Txt (579[16,8],586[17,5]): \n\t\t\t\t\t 225 Tag (586[17,5],608[17,27]): font color = "black" 226 Txt (608[17,27],610[17,29]): 网页 227 End (610[17,29],617[17,36]): /font 228 Txt (617[17,36],623[18,4]): \n\t\t\t\t 229 End (623[18,4],628[18,9]): /td 230 Txt (628[18,9],634[19,4]): \n\t\t\t\t 231 Tag (634[19,4],638[19,8]): td 232 Txt (638[19,8],645[20,5]): \n\t\t\t\t\t 233 Tag (645[20,5],698[20,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 234 Txt (698[20,58],700[20,60]): 贴吧 235 End (700[20,60],704[20,64]): /a 236 Txt (704[20,64],710[21,4]): \n\t\t\t\t 237 End (710[21,4],715[21,9]): /td 238 Txt (715[21,9],721[22,4]): \n\t\t\t\t 239 Tag (721[22,4],725[22,8]): td 240 Txt (725[22,8],732[23,5]): \n\t\t\t\t\t 241 Tag (732[23,5],785[23,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 242 Txt (785[23,58],787[23,60]): 知道 243 End (787[23,60],791[23,64]): /a 244 Txt (791[23,64],797[24,4]): \n\t\t\t\t 245 End (797[24,4],802[24,9]): /td 246 Txt (802[24,9],808[25,4]): \n\t\t\t\t 247 Tag (808[25,4],812[25,8]): td 248 Txt (812[25,8],819[26,5]): \n\t\t\t\t\t 249 Tag (819[26,5],872[26,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 250 Txt (872[26,58],874[26,60]): 音乐 251 End (874[26,60],878[26,64]): /a 252 Txt (878[26,64],884[27,4]): \n\t\t\t\t 253 End (884[27,4],889[27,9]): /td 254 Txt (889[27,9],895[28,4]): \n\t\t\t\t 255 Tag (895[28,4],899[28,8]): td 256 Txt (899[28,8],906[29,5]): \n\t\t\t\t\t 257 Tag (906[29,5],959[29,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 258 Txt (959[29,58],961[29,60]): 图片 259 End (961[29,60],965[29,64]): /a 260 Txt (965[29,64],971[30,4]): \n\t\t\t\t 261 End (971[30,4],976[30,9]): /td 262 Txt (976[30,9],982[31,4]): \n\t\t\t\t 263 Tag (982[31,4],986[31,8]): td 264 Txt (986[31,8],993[32,5]): \n\t\t\t\t\t 265 Tag (993[32,5],1046[32,58]): a href = "#" target = _blank title = "欢迎来到
百度网站" 266 Txt (1046[32,58],1048[32,60]): 视频 267 End (1048[32,60],1052[32,64]): /a 268 Txt (1052[32,64],1058[33,4]): \n\t\t\t\t 269 End (1058[33,4],1063[33,9]): /td 270 Txt (1063[33,9],1069[34,4]): \n\t\t\t\t 271 Tag (1069[34,4],1073[34,8]): td 272 Txt (1073[34,8],1080[35,5]): \n\t\t\t\t\t 273 Tag (1080[35,5],1133[35,58]): a href = "#" target = _blank title = "欢迎来到
百... 274 Txt (1133[35,58],1135[35,60]): 地图 275 End (1135[35,60],1139[35,64]): /a 276 Txt (1139[35,64],1145[36,4]): \n\t\t\t\t 277 End (1145[36,4],1150[36,9]): /td 278 Txt (1150[36,9],1155[37,3]): \n\t\t\t 279 End (1155[37,3],1163[37,11]): /table 280 Txt (1163[37,11],1168[38,3]): \n\t\t\t 281 Tag (1168[38,3],1192[38,27]): input class = "input" 282 Txt (1192[38,27],1196[39,2]): \n\t\t 283 End (1196[39,2],1202[39,8]): /div 284 Txt (1202[39,8],1205[40,1]): \n\t 285 End (1205[40,1],1212[40,8]): /body 286 Txt (1212[40,8],1216[42,0]): \n\n 287 End (1216[42,0],1223[42,7]): /html 288 289 ==============================
对于第一个Node的内容,对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">,从这个输出结果中,也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag,如DOCTYPE,head和html,分别形成了一个最高层的Node节点(很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行,空格,Tab等都转换成了相应的Tag,所以就出现了这样的Node。虽然内容少但是级别高,呵呵)
getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点,一是<head>标签中的Title内容是在plainText中的,可能在标题中可见的也算可见吧。另外就是象前面说的,HTML内容中的换行符什么的,也都成了plainText,这个逻辑上好像有点问题。
另外可能大家发现toHtml,toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的,如果跟踪HTMLParser的代码就可以发现,Node的子类是AbstractNode,其中实现了toHtml()的代码,直接调用toHtml(false),而AbstractNode的三个子类RemarkNode,TagNode和TextNode中,toHtml(boolean verbatim)的实现中,都没有处理verbatim参数,所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理,简单使用toHtml就可以了。
HTML的Node类继承关系如下图(这个是从别的文章Copy的)
他被组织成三棵树的森林,其中以<html>标签为根节点的树高度最大,网页的树状结构图如下:
html树中要特别注意的是每一个回车换行,HTMLParser会将他们看做一个节点处理。
AbstractNodes是Node的直接子类,也是一个抽象类。它的三个直接子类实现是RemarkNode,用于保存注释。在输出结果的toString部分中可以看到有一个"Rem (345[6,2],356[6,13]): 这是注释",就是一个RemarkNode。TextNode也很简单,就是用户可见的文字信息。TagNode是最复杂的,包含了HTML语言中的所有标签,而且可以扩展(扩展 HTMLParser 对自定义标签的处理能力)。TagNode包含两类,一类是简单的Tag,实际就是不能包含其他Tag的标签,只能做叶子节点。另一类是CompositeTag,就是可以包含其他Tag,是分支节点
HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。
(一)Filter类
顾名思义,Filter就是对于结果进行过滤,取得需要的内容。HTMLParser在org.htmlparser.filters包之内一共定义了16个不同的Filter,也可以分为几类。
判断类Filter:
TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter
逻辑运算Filter:
AndFilter
NotFilter
OrFilter
XorFilter
其他Filter:
NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter
所有的Filter类都实现了org.htmlparser.NodeFilter接口。这个接口只有一个主要函数:boolean accept (Node node);
各个子类分别实现这个函数,用于判断输入的Node是否符合这个Filter的过滤条件,如果符合,返回true,否则返回false。
(二)判断类Filter
2.1 TagNameFilter
TabNameFilter是最容易理解的一个Filter,根据Tag的名字进行过滤。
【源码:htmlparser_3.java】(此处只给出main方法的代码,其余代码同上)
1 /* 2 * main方法 3 */ 4 public static void main(String[] args) { 5 // String mContent=OpenFile(""); 6 try { 7 Parser mParser = new Parser((HttpURLConnection) (new URL( 8 "http://127.0.0.1/HtmlParser/index.html")).openConnection()); 9 10 // TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor(); 11 // mParser.visitAllNodesWith(mExtractingVisitor); 12 // String textInPage = mExtractingVisitor.getExtractedText(); 13 // message(textInPage); 14 15 // for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) { 16 // Node node = i.nextNode(); 17 // message("getText:"+node.getText()); 18 // message("getPlainText:"+node.toPlainTextString()); 19 // message("toHtml:"+node.toHtml()); 20 // message("toHtml(true):"+node.toHtml(true)); 21 // message("tohtml(false):"+node.toHtml(false)); 22 // message("toString:"+node.toString()); 23 // message("=============================="); 24 // } 25 26 NodeFilter mNodeFilter = new TagNameFilter("DIV"); 27 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter); 28 if (mNodeFilter!=null) { 29 for (int i = 0; i < mNodeList.size(); i++) { 30 Node textNode = (Node)mNodeList.elementAt(i); 31 message("getText:"+textNode.getText()); 32 message("==================================="); 33 } 34 } 35 36 } catch (Exception e) { 37 // TODO: handle exception 38 e.printStackTrace(); 39 } 40 }
测试输出结果:
1 getText:div align = "center" class = "photo" 2 =================================== 3 getText:div align = "center" class = "body" 4 ===================================
可以看出文件中两个Div节点都被取出了。下面可以针对这两个DIV节点进行操作。
2.2 HasChildFilter
下面让我们看看HasChildFilter。刚刚看到这个Filter的时候,我想当然地认为这个Filter返回的是有Child的Tag。直接初始化了一个
NodeFilter filter = new HasChildFilter();
结果调用NodeList nodes = parser.extractAllNodesThatMatch(filter);的时候HasChildFilter内部直接发生NullPointerException。读了一下HasChildFilter的代码,才发现,实际HasChildFilter是返回有符合条件的子节点的节点,需要另外一个Filter作为过滤子节点的参数。缺省的构造函数虽然可以初始化,但是由于子节点的Filter是null,所以使用的时候发生了Exception。从这点来看,HTMLParser的代码还有很多可以优化的的地方。呵呵。
修改代码:
1 /* 2 * main方法 3 */ 4 public static void main(String[] args) { 5 // String mContent=OpenFile(""); 6 try { 7 Parser mParser = new Parser((HttpURLConnection) (new URL( 8 "http://127.0.0.1/HtmlParser/index.html")).openConnection()); 9 NodeFilter mInnerFilter = new TagNameFilter("DIV"); 10 NodeFilter mNodeFilter = new HasChildFilter(mInnerFilter); 11 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter); 12 if (mNodeFilter!=null) { 13 for (int i = 0; i < mNodeList.size(); i++) { 14 Node textNode = (Node)mNodeList.elementAt(i); 15 message("getText:"+textNode.getText()); 16 message("==================================="); 17 } 18 } 19 20 } catch (Exception e) { 21 // TODO: handle exception 22 e.printStackTrace(); 23 } 24 }
测试输出结果:
1 getText:body 2 ===================================
在此处可以看到,输出的是含有DIV子Tag的Tag节点。(body有子节点DIV“<div align = "center" class = "photo" >”)
注意HasChildFilter还有一个构造函数:public HasChildFilter (NodeFilter filter, boolean recursive)
如果recursive是false,则只对第一级子节点进行过滤。比如前面的例子,body在第一级的子节点里就有DIV节点,所以匹配上了。如果我们用下面的方法调用:
NodeFilter filter = new HasChildFilter( innerFilter, true );
测试输出结果:
1 getText:html 2 =================================== 3 getText:body 4 ===================================
可以看到输出结果中多了一个html ,这个是整个HTML页面的节点(根节点),虽然这个节点下直接没有DIV节点,但是它的子节点body下面有DIV节点,所以它也被匹配上了。
2.3 HasAttributeFilter
HasAttributeFilter有3个构造函数:
public HasAttributeFilter ();
public HasAttributeFilter (String attribute);
public HasAttributeFilter (String attribute, String value);
这个Filter可以匹配出包含制定名字的属性,或者制定属性为指定值的节点。还是用例子说明比较容易。
调用方法1:
1 NodeFilter mNodeFilter = new HasAttributeFilter(); 2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
输出结果:
什么也没有输出
调用方法2:
1 NodeFilter mNodeFilter = new HasAttributeFilter("class"); 2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
输出结果:
1 getText:div align = "center" class = "photo" 2 =================================== 3 getText:div align = "center" class = "body" 4 =================================== 5 getText:input class = "input" 6 ===================================
调用方法3:
1 NodeFilter mNodeFilter = new HasAttributeFilter("class","photo"); 2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
输出结果:
1 getText:div align = "center" class = "photo" 2 ===================================
2.4 其他判断列Filter
HasParentFilter和HasSiblingFilter的功能与HasChildFilter类似,大家自己试一下就应该了解了。
IsEqualFilter的构造函数参数是一个Node:
public IsEqualFilter (Node node) {
mNode = node;
}
accept函数也很简单:
public boolean accept (Node node) {
return (mNode == node);
}
不需要过多说明了。
(三)逻辑运算Filter
前面介绍的都是简单的Filter,只能针对某种单一类型的条件进行过滤。HTMLParser支持对于简单类型的Filter进行组合,从而实现复杂的条件。原理和一般编程语言的逻辑运算是一样的。
3.1 AndFilter
AndFilter可以把两种Filter进行组合,只有同时满足条件的Node才会被过滤。
测试代码:
1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class"); 2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align"); 3 NodeFilter mNodeFilter = new AndFilter(mNodeFilterLeft, mNodeFilterRight); 4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
测试输出结果:
1 getText:div align = "center" class = "photo" 2 =================================== 3 getText:div align = "center" class = "body" 4 ===================================
3.2 OrFilter
把前面的AndFilter换成OrFilter
测试代码:
1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class"); 2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align"); 3 NodeFilter mNodeFilter = new OrFilter(mNodeFilterLeft, mNodeFilterRight); 4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
测试输出结果:
1 getText:div align = "center" class = "photo" 2 =================================== 3 getText:div align = "center" class = "body" 4 =================================== 5 getText:input class = "input" 6 ===================================
3.3 NotFilter
把前面的AndFilter换成NotFilter
测试代码:
1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class"); 2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align"); 3 NodeFilter mNodeFilter = new NotFilter(new OrFilter(mNodeFilterLeft,mNodeFilterRight)); 4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
测试输出结果:
1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 2 =================================== 3 getText: 4 5 =================================== 6 getText:html 7 =================================== 8 getText: 9 10 =================================== 11 getText:head 12 =================================== 13 getText: 14 15 =================================== 16 getText:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/ 17 =================================== 18 getText: 19 20 =================================== 21 getText:title 22 =================================== 23 getText:百度 24 =================================== 25 getText:/title 26 =================================== 27 getText: 28 29 =================================== 30 getText:link href = "a_1.css" rel = "stylesheet" type = "text/css"/ 31 =================================== 32 getText: 33 34 =================================== 35 getText:/head 36 =================================== 37 getText: 38 39 =================================== 40 getText:body 41 =================================== 42 getText: 43 44 =================================== 45 getText: 46 47 =================================== 48 getText:img src = "../image/baidu.PNG" 49 =================================== 50 getText: 51 52 =================================== 53 getText:/div 54 =================================== 55 getText: 56 57 =================================== 58 getText: 59 60 =================================== 61 getText:table cellpadding="8" 62 =================================== 63 getText: 64 65 =================================== 66 getText:td 67 =================================== 68 getText: 69 70 =================================== 71 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 72 =================================== 73 getText:新闻 74 =================================== 75 getText:/a 76 =================================== 77 getText: 78 79 =================================== 80 getText:/td 81 =================================== 82 getText: 83 84 =================================== 85 getText:td 86 =================================== 87 getText: 88 89 =================================== 90 getText:font color = "black" 91 =================================== 92 getText:网页 93 =================================== 94 getText:/font 95 =================================== 96 getText: 97 98 =================================== 99 getText:/td 100 =================================== 101 getText: 102 103 =================================== 104 getText:td 105 =================================== 106 getText: 107 108 =================================== 109 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 110 =================================== 111 getText:贴吧 112 =================================== 113 getText:/a 114 =================================== 115 getText: 116 117 =================================== 118 getText:/td 119 =================================== 120 getText: 121 122 =================================== 123 getText:td 124 =================================== 125 getText: 126 127 =================================== 128 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 129 =================================== 130 getText:知道 131 =================================== 132 getText:/a 133 =================================== 134 getText: 135 136 =================================== 137 getText:/td 138 =================================== 139 getText: 140 141 =================================== 142 getText:td 143 =================================== 144 getText: 145 146 =================================== 147 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 148 =================================== 149 getText:音乐 150 =================================== 151 getText:/a 152 =================================== 153 getText: 154 155 =================================== 156 getText:/td 157 =================================== 158 getText: 159 160 =================================== 161 getText:td 162 =================================== 163 getText: 164 165 =================================== 166 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 167 =================================== 168 getText:图片 169 =================================== 170 getText:/a 171 =================================== 172 getText: 173 174 =================================== 175 getText:/td 176 =================================== 177 getText: 178 179 =================================== 180 getText:td 181 =================================== 182 getText: 183 184 =================================== 185 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 186 =================================== 187 getText:视频 188 =================================== 189 getText:/a 190 =================================== 191 getText: 192 193 =================================== 194 getText:/td 195 =================================== 196 getText: 197 198 =================================== 199 getText:td 200 =================================== 201 getText: 202 203 =================================== 204 getText:a href = "#" target = _blank title = "欢迎来到
百度网站" 205 =================================== 206 getText:地图 207 =================================== 208 getText:/a 209 =================================== 210 getText: 211 212 =================================== 213 getText:/td 214 =================================== 215 getText: 216 217 =================================== 218 getText:/table 219 =================================== 220 getText: 221 222 =================================== 223 getText: 224 225 =================================== 226 getText:/div 227 =================================== 228 getText: 229 230 =================================== 231 getText:/body 232 =================================== 233 getText: 234 235 236 =================================== 237 getText:/html 238 ===================================
3.4 XorFilter(暂未实现)
把前面的AndFilter换成NotFilter
测试代码:……
测试输出结果:……
(四)其他Filter:
4.1 NodeClassFilter
这个Filter用于判断节点类型是否是某个特定的Node类型。在上面中我们已经了解了Node的不同类型,这个Filter就可以针对类型进行过滤。
测试代码:
测试输出结果:
4.2 StringFilter
这个Filter用于过滤显示字符串中包含制定内容的Tag。注意是可显示的字符串,不可显示的字符串中的内容(例如注释,链接等等)不会被显示。
测试代码:
1 NodeFilter mNodeFilter = new StringFilter("贴吧"); 2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
测试输出结果:
1 getText:贴吧 2 ===================================
4.3 LinkStringFilter
这个Filter用于判断链接中是否包含某个特定的字符串,可以用来过滤出指向某个特定网站的链接。
测试代码:
1 NodeFilter mNodeFilter = new LinkStringFilter("http://tieba.baidu.com/"); 2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
测试输出结果:(此处需要修改html例子的代码,修改后为:【<a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到
百度网站">贴吧</a>】)
1 getText:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到
百度网站" 2 ===================================
4.4 其他几个Filter
其他几个Filter也是根据字符串对不同的域进行判断,与前面这些的区别主要就是支持正则表达式。这个不在本文的讨论范围以内,大家可以自己实验一下。
HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。
下面介绍使用Visitor访问内容的方法。
5.1 NodeVisitor
从简单方面的理解,Filter是根据某种条件过滤取出需要的Node再进行处理。Visitor则是遍历内容树的每一个节点,对于符合条件的节点进行处理。实际的结果异曲同工,两种不同的方法可以达到相同的结果。
下面是一个最常见的NodeVisitro的例子。
测试代码:
1 public static void main(String[] args) { 2 // TODO Auto-generated method stub 3 try { 4 5 Parser mParser = new Parser( 6 (HttpURLConnection) (new URL( 7 "http://127.0.0.1/HtmlParser/index.html")) 8 .openConnection()); 9 NodeVisitor mNodeVisitor = new NodeVisitor(false, false) { 10 @Override 11 public void visitTag(Tag tag) { 12 // TODO Auto-generated method stub 13 message("This is Tag:" + tag.getText()); 14 } 15 16 @Override 17 public void visitStringNode(Text string) { 18 // TODO Auto-generated method stub 19 message("This is Text:" + string); 20 } 21 22 @Override 23 public void visitRemarkNode(Remark remark) { 24 // TODO Auto-generated method stub 25 message("This is Remark:" + remark.getText()); 26 } 27 28 @Override 29 public void beginParsing() { 30 // TODO Auto-generated method stub 31 message("begin Parsing"); 32 } 33 34 @Override 35 public void visitEndTag(Tag tag) { 36 // TODO Auto-generated method stub 37 message("visitEndTag:" + tag.getText()); 38 } 39 40 @Override 41 public void finishedParsing() { 42 // TODO Auto-generated method stub 43 message("finishedParsing!"); 44 } 45 }; 46 mParser.visitAllNodesWith(mNodeVisitor); 47 } catch (Exception e) { 48 // TODO: handle exception 49 } 50 }
测试输出结果:
1 begin Parsing 2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 3 This is Text:Txt (121[0,121],123[1,0]): \n 4 finishedParsing!
可以看到,开始遍历所以的节点以前,beginParsing先被调用,然后处理的是中间的Node,最后在结束遍历以前,finishParsing被调用。因为我设置的 recurseChildren和recurseSelf都是false,所以Visitor没有访问子节点也没有访问根节点的内容。中间输出的两个\n就是我们在前面初始化Parser 中讨论过的最高层的那两个换行。
我们先把recurseSelf设置成true,看看会发生什么。
1 NodeVisitor visitor = new NodeVisitor( false, true)
输出结果 :
1 begin Parsing 2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 3 This is Text:Txt (121[0,121],123[1,0]): \n 4 This is Tag:html 5 finishedParsing!
可以看到,HTML页面的第一层节点都被调用了。
我们再用下面的方法调用看看:
1 NodeVisitor mNodeVisitor = new NodeVisitor(true, false)
输出结果:
1 begin Parsing 2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 3 This is Text:Txt (121[0,121],123[1,0]): \n 4 This is Text:Txt (129[1,6],132[2,1]): \n\t 5 This is Text:Txt (138[2,7],142[3,2]): \n\t\t 6 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/ 7 This is Text:Txt (216[3,76],220[4,2]): \n\t\t 8 This is Remark:<title>百度</title> 9 This is Text:Txt (244[4,26],248[5,2]): \n\t\t 10 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/ 11 This is Text:Txt (309[5,63],312[6,1]): \n\t 12 visitEndTag:/head 13 This is Text:Txt (319[6,8],322[7,1]): \n\t 14 This is Text:Txt (328[7,7],332[8,2]): \n\t\t 15 This is Text:Txt (372[8,42],377[9,3]): \n\t\t\t 16 This is Tag:img src = "../image/baidu.PNG" 17 This is Text:Txt (410[9,36],414[10,2]): \n\t\t 18 visitEndTag:/div 19 This is Text:Txt (420[10,8],424[11,2]): \n\t\t 20 This is Text:Txt (461[11,39],466[12,3]): \n\t\t\t 21 This is Text:Txt (489[12,26],495[13,4]): \n\t\t\t\t 22 This is Text:Txt (499[13,8],506[14,5]): \n\t\t\t\t\t 23 This is Text:Txt (559[14,58],561[14,60]): 新闻 24 visitEndTag:/a 25 This is Text:Txt (565[14,64],571[15,4]): \n\t\t\t\t 26 visitEndTag:/td 27 This is Text:Txt (576[15,9],582[16,4]): \n\t\t\t\t 28 This is Text:Txt (586[16,8],593[17,5]): \n\t\t\t\t\t 29 This is Tag:font color = "black" 30 This is Text:Txt (615[17,27],617[17,29]): 网页 31 visitEndTag:/font 32 This is Text:Txt (624[17,36],630[18,4]): \n\t\t\t\t 33 visitEndTag:/td 34 This is Text:Txt (635[18,9],641[19,4]): \n\t\t\t\t 35 This is Text:Txt (645[19,8],652[20,5]): \n\t\t\t\t\t 36 This is Text:Txt (727[20,80],729[20,82]): 贴吧 37 visitEndTag:/a 38 This is Text:Txt (733[20,86],739[21,4]): \n\t\t\t\t 39 visitEndTag:/td 40 This is Text:Txt (744[21,9],750[22,4]): \n\t\t\t\t 41 This is Text:Txt (754[22,8],761[23,5]): \n\t\t\t\t\t 42 This is Text:Txt (814[23,58],816[23,60]): 知道 43 visitEndTag:/a 44 This is Text:Txt (820[23,64],826[24,4]): \n\t\t\t\t 45 visitEndTag:/td 46 This is Text:Txt (831[24,9],837[25,4]): \n\t\t\t\t 47 This is Text:Txt (841[25,8],848[26,5]): \n\t\t\t\t\t 48 This is Text:Txt (901[26,58],903[26,60]): 音乐 49 visitEndTag:/a 50 This is Text:Txt (907[26,64],913[27,4]): \n\t\t\t\t 51 visitEndTag:/td 52 This is Text:Txt (918[27,9],924[28,4]): \n\t\t\t\t 53 This is Text:Txt (928[28,8],935[29,5]): \n\t\t\t\t\t 54 This is Text:Txt (988[29,58],990[29,60]): 图片 55 visitEndTag:/a 56 This is Text:Txt (994[29,64],1000[30,4]): \n\t\t\t\t 57 visitEndTag:/td 58 This is Text:Txt (1005[30,9],1011[31,4]): \n\t\t\t\t 59 This is Text:Txt (1015[31,8],1022[32,5]): \n\t\t\t\t\t 60 This is Text:Txt (1075[32,58],1077[32,60]): 视频 61 visitEndTag:/a 62 This is Text:Txt (1081[32,64],1087[33,4]): \n\t\t\t\t 63 visitEndTag:/td 64 This is Text:Txt (1092[33,9],1098[34,4]): \n\t\t\t\t 65 This is Text:Txt (1102[34,8],1109[35,5]): \n\t\t\t\t\t 66 This is Text:Txt (1162[35,58],1164[35,60]): 地图 67 visitEndTag:/a 68 This is Text:Txt (1168[35,64],1174[36,4]): \n\t\t\t\t 69 visitEndTag:/td 70 This is Text:Txt (1179[36,9],1184[37,3]): \n\t\t\t 71 visitEndTag:/table 72 This is Text:Txt (1192[37,11],1197[38,3]): \n\t\t\t 73 This is Tag:input class = "input" 74 This is Text:Txt (1221[38,27],1225[39,2]): \n\t\t 75 visitEndTag:/div 76 This is Text:Txt (1231[39,8],1234[40,1]): \n\t 77 visitEndTag:/body 78 This is Text:Txt (1241[40,8],1245[42,0]): \n\n 79 visitEndTag:/html 80 finishedParsing!
可以看到,所有的子节点都出现了,除了刚刚例子里面的两个最上层节点This is Tag:head和This is Tag:html xmlns="http://www.w3.org/1999/xhtml"。
想让它们都出来,只需要
1 NodeVisitor mNodeVisitor = new NodeVisitor(true, true)
输出结果:
1 begin Parsing 2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" 3 This is Text:Txt (121[0,121],123[1,0]): \n 4 This is Tag:html 5 This is Text:Txt (129[1,6],132[2,1]): \n\t 6 This is Tag:head 7 This is Text:Txt (138[2,7],142[3,2]): \n\t\t 8 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/ 9 This is Text:Txt (216[3,76],220[4,2]): \n\t\t 10 This is Remark:<title>百度</title> 11 This is Text:Txt (244[4,26],248[5,2]): \n\t\t 12 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/ 13 This is Text:Txt (309[5,63],312[6,1]): \n\t 14 visitEndTag:/head 15 This is Text:Txt (319[6,8],322[7,1]): \n\t 16 This is Tag:body 17 This is Text:Txt (328[7,7],332[8,2]): \n\t\t 18 This is Tag:div align = "center" class = "photo" 19 This is Text:Txt (372[8,42],377[9,3]): \n\t\t\t 20 This is Tag:img src = "../image/baidu.PNG" 21 This is Text:Txt (410[9,36],414[10,2]): \n\t\t 22 visitEndTag:/div 23 This is Text:Txt (420[10,8],424[11,2]): \n\t\t 24 This is Tag:div align = "center" class = "body" 25 This is Text:Txt (461[11,39],466[12,3]): \n\t\t\t 26 This is Tag:table cellpadding="8" 27 This is Text:Txt (489[12,26],495[13,4]): \n\t\t\t\t 28 This is Tag:td 29 This is Text:Txt (499[13,8],506[14,5]): \n\t\t\t\t\t 30 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 31 This is Text:Txt (559[14,58],561[14,60]): 新闻 32 visitEndTag:/a 33 This is Text:Txt (565[14,64],571[15,4]): \n\t\t\t\t 34 visitEndTag:/td 35 This is Text:Txt (576[15,9],582[16,4]): \n\t\t\t\t 36 This is Tag:td 37 This is Text:Txt (586[16,8],593[17,5]): \n\t\t\t\t\t 38 This is Tag:font color = "black" 39 This is Text:Txt (615[17,27],617[17,29]): 网页 40 visitEndTag:/font 41 This is Text:Txt (624[17,36],630[18,4]): \n\t\t\t\t 42 visitEndTag:/td 43 This is Text:Txt (635[18,9],641[19,4]): \n\t\t\t\t 44 This is Tag:td 45 This is Text:Txt (645[19,8],652[20,5]): \n\t\t\t\t\t 46 This is Tag:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到
百度网站" 47 This is Text:Txt (727[20,80],729[20,82]): 贴吧 48 visitEndTag:/a 49 This is Text:Txt (733[20,86],739[21,4]): \n\t\t\t\t 50 visitEndTag:/td 51 This is Text:Txt (744[21,9],750[22,4]): \n\t\t\t\t 52 This is Tag:td 53 This is Text:Txt (754[22,8],761[23,5]): \n\t\t\t\t\t 54 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 55 This is Text:Txt (814[23,58],816[23,60]): 知道 56 visitEndTag:/a 57 This is Text:Txt (820[23,64],826[24,4]): \n\t\t\t\t 58 visitEndTag:/td 59 This is Text:Txt (831[24,9],837[25,4]): \n\t\t\t\t 60 This is Tag:td 61 This is Text:Txt (841[25,8],848[26,5]): \n\t\t\t\t\t 62 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 63 This is Text:Txt (901[26,58],903[26,60]): 音乐 64 visitEndTag:/a 65 This is Text:Txt (907[26,64],913[27,4]): \n\t\t\t\t 66 visitEndTag:/td 67 This is Text:Txt (918[27,9],924[28,4]): \n\t\t\t\t 68 This is Tag:td 69 This is Text:Txt (928[28,8],935[29,5]): \n\t\t\t\t\t 70 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 71 This is Text:Txt (988[29,58],990[29,60]): 图片 72 visitEndTag:/a 73 This is Text:Txt (994[29,64],1000[30,4]): \n\t\t\t\t 74 visitEndTag:/td 75 This is Text:Txt (1005[30,9],1011[31,4]): \n\t\t\t\t 76 This is Tag:td 77 This is Text:Txt (1015[31,8],1022[32,5]): \n\t\t\t\t\t 78 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 79 This is Text:Txt (1075[32,58],1077[32,60]): 视频 80 visitEndTag:/a 81 This is Text:Txt (1081[32,64],1087[33,4]): \n\t\t\t\t 82 visitEndTag:/td 83 This is Text:Txt (1092[33,9],1098[34,4]): \n\t\t\t\t 84 This is Tag:td 85 This is Text:Txt (1102[34,8],1109[35,5]): \n\t\t\t\t\t 86 This is Tag:a href = "#" target = _blank title = "欢迎来到
百度网站" 87 This is Text:Txt (1162[35,58],1164[35,60]): 地图 88 visitEndTag:/a 89 This is Text:Txt (1168[35,64],1174[36,4]): \n\t\t\t\t 90 visitEndTag:/td 91 This is Text:Txt (1179[36,9],1184[37,3]): \n\t\t\t 92 visitEndTag:/table 93 This is Text:Txt (1192[37,11],1197[38,3]): \n\t\t\t 94 This is Tag:input class = "input" 95 This is Text:Txt (1221[38,27],1225[39,2]): \n\t\t 96 visitEndTag:/div 97 This is Text:Txt (1231[39,8],1234[40,1]): \n\t 98 visitEndTag:/body 99 This is Text:Txt (1241[40,8],1245[42,0]): \n\n 100 visitEndTag:/html 101 finishedParsing!
哈哈,这下调用清楚了,大家在需要处理的地方增加自己的代码好了。
5.2 其他Visitor
……
到此,个人感觉与htmlparser的缘分已尽!下一步,进军JSoup!!!
===========================参考网址===========================
http://www.blogjava.net/amigoxie/archive/2008/01/18/176200.html
http://www.cnblogs.com/loveyakamoz/archive/2011/07/27/2118937.html
http://blog.csdn.net/witsmakemen/article/details/8778979
===========================参考网址===========================