Java Lucene (9):HTMLParser与html页面解析

java lucene 技术(9):HTMLParser与html页面解析
HTMLParser 是一个开源的 Java 库,它提供了接口,支持线性和嵌套 HTML 文本。在实际的项目中只需要将 htmlparser.jar 导入 classpath 中,就可以使用 HTMLParser 提供的 API 了。
HTML 3 种类型的节点: RemarkNode:html 中的注释, TagNode: 标签节点, TextNode: 文本节点。 HTMLParser 将读取的二进制数据流,进行编码转换、词法分析等操作,生成树形层次结构的 Node 节点集合。下面的程序说明了一个范例 html 页面被 HTMLParser 解析的结果。
程序 9_1:

Parser parser = new Parser ( "E:/t.html");

parser.setEncoding( "UTF-8");

NodeList list = parser.parse ( null);

String str = list.toString();

System.out.println (str);


其中 t.html 源码如下:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> 北京龙卷风科技 </title>
</head>
<body>
<p>
龙卷风科技 _ 优秀的信息检索平台
网址 [url]http://www.tornado.cn[/url]
</p>
</body>
</html>
 
打印结果如下:
Txt (0[0,0],1[0,1]): ?Tag (1[0,1],7[0,7]): html
  Txt (7[0,7],9[1,0]): \n
  Tag (9[1,0],15[1,6]): head
    Txt (15[1,6],17[2,0]): \n
    Tag (17[2,0],86[2,69]): meta http-equiv="Content-Type" content="text/html; ch...
    Txt (86[2,69],88[3,0]): \n
    Tag (88[3,0],95[3,7]): title
      Txt (95[3,7],102[3,14]): 北京龙卷风科技
      End (102[3,14],110[3,22]): /title
    Txt (110[3,22],112[4,0]): \n
    End (112[4,0],119[4,7]): /head
  Txt (119[4,7],121[5,0]): \n
  Tag (121[5,0],127[5,6]): body
    Txt (127[5,6],129[6,0]): \n
    Tag (129[6,0],132[6,3]): p
    Txt (132[6,3],177[9,0]): \n 龙卷风科技 _ 优秀的信息检索平台 \n 网址: [url]http://www.tornado.cn[/url]\n
    End (177[9,0],181[9,4]): /p
    Txt (181[9,4],183[10,0]): \n
    End (183[10,0],190[10,7]): /body
  Txt (190[10,7],192[11,0]): \n
  End (192[11,0],199[11,7]): /html
Txt (199[11,7],201[12,0]): \n
 
下面创建一个测试类,实现从 html 页面中提取文本内容信息。
程序 9-2



public class SimpleHtmlparser {

    

             public static void main(String args[]) throws ParserException{

                     Parser parser;

                        String body = "";

                    

                        parser = new Parser(args[0]);

                        parser.setEncoding( "UTF-8");

                        HtmlPage htmlpage = new HtmlPage(parser);

                        parser.visitAllNodesWith(htmlpage);

                        body = htmlpage.getBody().toHtml();

                    

                        Parser nodesParser;

                        NodeList nodeList = null;

                        nodesParser = Parser.createParser(body, "UTF-8");

                        NodeFilter textFilter = new NodeClassFilter(TextNode. class);



                         try

                        {

                                nodeList = nodesParser.parse(textFilter);

                        }

                         catch (ParserException e)

                        {

                                e.printStackTrace();

                        }



                         if ( null == nodeList)

                        {

                                System.out.println( " ");

                        }



                        Node[] nodes = nodeList.toNodeArray();

                        StringBuffer result = new StringBuffer();

                         for ( int i = 0; i < nodes.length; i++)

                        {

                                Node nextNode = (Node) nodes[i];

                                String content = "";

                                 if (nextNode instanceof TextNode)

                                {

                                        TextNode textnode = (TextNode) nextNode;

                                        content = textnode.getText();

                                }

                                result.append( " ");

                                System.out.println(content);

                        }

             }

}



经过测试,发现 HTMLParser 虽然可以较好的提取 html 页面文本信息,但对 javascript 标签的处理不好,另外对样式表 <style> 也不能较好的清除掉。
             笔者MSN:[email protected]   QQ:569634476
               [url]http://www.tornado.cn[/url]

你可能感兴趣的:(java,职场,Lucene,HtmlParser,休闲)