htmlparser,轻量级网页抓取分析工具

htmlparser简小精悍,对于抓取普通的html页面,具有按照css查找节点的功能,如:

public static void main(String[] args) throws IOException, ParserException {
        String site = "http://tech.qq.com/a/20131112/011680.htm";
        String site2="http://www.chinanews.com/gn/2013/11-12/5492942.shtml";
        URL url  = new URL(site2);
        URLConnection urlConnection = url.openConnection();
        Parser parser = new Parser(urlConnection);
        parser.setEncoding("GBK");

        /*TextExtractingVisitor visitor = new TextExtractingVisitor();
        parser.visitAllNodesWith(visitor);
        String textInPage = visitor.getExtractedText();*/


       /* AndFilter andFilter = new AndFilter(new TagNameFilter("div"),new HasAttributeFilter("id","Cnt-Main-Article-QQ"));
        NodeList nodes = parser.parse(andFilter);
        System.out.println("html:["+nodes.toHtml()+"]");*/

        //CssSelectorNodeFilter cssSelectorNodeFilter = new CssSelectorNodeFilter("#Cnt-Main-Article-QQ");
        CssSelectorNodeFilter cssSelectorNodeFilter = new CssSelectorNodeFilter(".left_zw");
        NodeList nodes2 = parser.parse(cssSelectorNodeFilter);
        System.out.println("html:["+nodes2.toHtml()+"]");
        //logger.info("text:["+textInPage+"]");

        logger.info("ok");
    }



相应的pom为:
<dependency>
            <groupId>org.htmlparser</groupId>
            <artifactId>htmlparser</artifactId>
            <version>2.1</version>
        </dependency>

你可能感兴趣的:(HtmlParser)