java 网络爬虫工具

今儿无意中发现了一个网络爬虫工具, htmlunit, 支持动态页面处理, 而HttpClient只支持静态页面。


maven: 

    net.sourceforge.htmlunit
    htmlunit
    2.22


api:

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    try (final WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
        Assert.assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

        final String pageAsXml = page.asXml();
        Assert.assertTrue(pageAsXml.contains(""));

        final String pageAsText = page.asText();
        Assert.assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
    }
}


指定特定的浏览器:

@Test
public void homePage_Firefox() throws Exception {
    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
        final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
        Assert.assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
    }
}


你可能感兴趣的:(J2EE,网络爬虫,http,httpclient,htmlunit)