HtmlUnit抓取js渲染页面

需求:

需要采集js渲染的页面,有些网站的页面是js渲染的

实现:

基于HtmlUnit实现:

public static void getAjaxPage() throws Exception{
	WebClient webClient = new WebClient();
	webClient.setJavaScriptEnabled(true);
	webClient.setCssEnabled(false);
	webClient.setAjaxController(new NicelyResynchronizingAjaxController());
	webClient.setTimeout(Integer.MAX_VALUE);
	webClient.setThrowExceptionOnScriptError(false);
	HtmlPage rootPage = webClient.getPage("http://tt.mop.com/read_14304066_1_0.html");

	System.out.println(rootPage.asXml());
}

maven依赖:


	net.sourceforge.htmlunit
	htmlunit-core-js
	2.9
	compile


	net.sourceforge.htmlunit
	htmlunit
	2.9
	compile

说明: 

Nutch插件:nutch-htmlunit用于替换Nutch自身的Http Fetch组件

 

你可能感兴趣的:(Spider,Java)