Java爬虫之Htmlunit,HttpClient的使用

博客链接:Cs XJH’s Blog

由于最近接手一个项目需要爬取网页数据,故学习了下爬虫的相关知识。
都说Python是专业的爬虫工具,但奈何项目是用Java写的,所以从Maven的仓库中找到了Htmlunit和HttpClient这两个工具。熟悉之后发现,其实他们也是很强大好用的。

首先,说明下环境:

<parent>
        <groupId>org.springframework.bootgroupId>
        <artifactId>spring-boot-starter-parentartifactId>
        <version>2.2.0.RELEASEversion>
        <relativePath/> 
parent>
<dependency>
        <groupId>net.sourceforge.htmlunitgroupId>
        <artifactId>htmlunitartifactId> 
dependency>
<dependency>
        <groupId>org.apache.httpcomponentsgroupId>
        <artifactId>httpclientartifactId>
dependency>

htmlunit和httpclient的版本继承了spring-boot-starter-parent中默认定义的。

Htmlunit

相对于httpclient来说,htmlunit是更陌生的。htmlunit是一个可以模拟操作浏览器的工具,并且支持JS后台执行。此外,它支持DOM,CSS,Xpath三种方式解析html。

htmlunit的优势在于它模拟登陆十分方便,不需要构造表单数据,而是直接填充;并且对于前后端结合的网页项目来说,使用它解析html十分方便。

另外,Java爬虫中还有一个很有名的工具Jsoup,它和htmlunit在解析html上一样强大,但是对于模拟登陆来说,它需要构造表单数据。而登陆往往会有像XSS安全防护,甚至一些表单数据构造起来相当麻烦。所以,对于本项目来说,htmlunit更符合需求。

配置

// 注入IOC容器
@Bean
public WebClient getWebClient() {
	WebClient webClient = new WebClient(BrowserVersion.CHROME);
	webClient.getOptions().setRedirectEnabled(true);
	// 允许重定向
	webClient.getOptions().setJavaScriptEnabled(true);
	// 启动JS解释器
	webClient.getOptions().setCssEnabled(false);
	// 禁用CSS支持
	webClient.getOptions().setActiveXNative(false);
	// 是否启用ActiveX(用于动画,视频之类)
	webClient.getOptions().setThrowExceptionOnScriptError(false);
	// js运行错误时,不抛出异常
	webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
	// 状态码不为200时不报错
	webClient.setAjaxController(new NicelyResynchronizingAjaxController());
	// 设置Ajax异步处理控制器即启用Ajax支持
	webClient.setJavaScriptTimeout(10 * 1000);
	return webClient;
}

使用

// 模拟登陆
public void login(WebClient webClient) throws BaseException {
	webClient.getCookieManager().clearCookies();  // 清空cookie
	
	String homeUrl = "";
	try {
		HtmlPage loginPage = webClient.getPage(loginUrl);
		webClient.waitForBackgroundJavaScript(1000);

		HtmlTextInput nameInput = loginPage.getHtmlElementById(userNameElement);
		HtmlTextInput pwdInput = loginPage.getHtmlElementById(userPwdElement);
		HtmlButton submit = loginPage.getHtmlElementById(submitElement);

		nameInput.setText(userName());
		pwdInput.setText(userPwd());

		HtmlPage nextPage = submit.click();
		homeUrl = nextPage.getBaseURL().toString();
	}
	catch (IOException e) {
		e.printStackTrace();
	}

	if (!homeUrl.equals(homeUrl())) {
		throw new BaseException(1002, "账号或密码错误");
	}
}


// 解析Html Table
HtmlPage coursePage = webClient.getPage(url);
HtmlTable courseTable = coursePage.getHtmlElementById("Table");

for (int i=0,rLen=courseTable.getRowCount(); i<rLen; i++) {
	HtmlTableRow row = courseTable.getRow(i);
  
	for (int j=0,cLen=row.getCells().size(); j<cLen; j++) {
      
	}
}

HttpClient

httpclient是用来模拟发送http请求的工具,常用于解析restful 风格的接口的响应。并且,它不适用与解析html。所以,httpclient只适用于在前后端分离的网页上爬取数据。

配置

// httpclient的存储cookie对象
@Bean
public CookieStore getCookieStore() {
	return new BasicCookieStore();
}

@Bean(name = "httpClient")
public CloseableHttpClient getHttpClient(CookieStore cookieStore) {
	return HttpClients.custom()
	                .setDefaultCookieStore(cookieStore)
	                .build();
}

// 构造支持https请求的httpclient
@Bean(name = "httpsClient")
public HttpClient getHttpsClient() {
	SSLConnectionSocketFactory sslsf = null;
	try {
		SSLContext sslContext = SSLContext.getInstance("TLS");
		sslContext.init(null, new TrustManager[] {
			new X509TrustManager() {
				@Override
				public void checkClientTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
				}

				@Override
				public void checkServerTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
				}

				@Override
				public X509Certificate[] getAcceptedIssuers() {
					return null;
				}
			}
		}, null);

		sslsf = new SSLConnectionSocketFactory(sslContext, NoopHostnameVerifier.INSTANCE);
	}catch (NoSuchAlgorithmException | KeyManagementException e) {
		e.printStackTrace();
	}

	return HttpClients.custom().setSSLSocketFactory(sslsf)
	                .setMaxConnTotal(50)
	                .setMaxConnPerRoute(50)
	                .setDefaultRequestConfig(RequestConfig.custom()
	                        .setConnectionRequestTimeout(60000)
	                        .setConnectTimeout(60000)
	                        .setSocketTimeout(60000)
	                        .build())
	                .build();
}

使用

BasicClientCookie cookie = new BasicClientCookie(sessionName, session.getValue());

cookie.setDomain(session.getDomain());
cookie.setExpiryDate(session.getExpires());
cookie.setPath(session.getPath());

cookieStore.addCookie(cookie);

CloseableHttpResponse response = null;
try {
	// 默认ContentType是application/x-www-form-urlencoded
	UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");
	httpPost.setEntity(formEntity);
  
	// 发送post请求
	response = httpClient.execute(httpPost);
	String json = EntityUtils.toString(response.getEntity());
  
	// 关闭entity
	EntityUtils.consume(response.getEntity());
}catch (IOException e) {
	e.printStackTrace();
}finally {
    try {
        response.close(); // 关闭响应
    }catch (IOException e) {
      e.printStackTrace();
    }
}

你可能感兴趣的:(后端开发)