五种实现网络爬虫的方法(三,基于httpclient编写爬虫)

咕咕咕~

总所周知httpclient是java爬虫的利器,

一般我个人开发,都是用httpclient来编写抓取登陆代理等,用jsoup,xpath,正则来处理解析。

废话不多说直接上代码。

public static String getPageContent(String url) {
		// 创建一个客户端,类似于打开一个浏览器
		DefaultHttpClient httpClient = new DefaultHttpClient();
		// 创建一个GET方法,类似在浏览器地址栏中输入一个地址
		HttpGet httpGet = new HttpGet(url);
		String content = "";
		try {
			// 类似与在浏览器地址栏中输入回车,获得网页内容
			HttpResponse response = httpClient.execute(httpGet);
			// 查看返回内容
			HttpEntity entity = response.getEntity();
			if (entity != null) {
				content += EntityUtils.toString(entity, "utf-8");
				EntityUtils.consume(entity);// 关闭内容流
			}
		} catch (Exception e) {
			logger.error("网页获取内容失败:" + e);
		}
		httpClient.getConnectionManager().shutdown();
		return content;
	}

这就是一个简易版的httpclient抓取的代码,用的是defaulthttpclient,需要手动关闭连接,否则再次连接则会冲突。

当然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();则更为方便。

上述代码有没有问题呢,没有。

但是也有,为什么这么说呢,因为忽视了header的设置,许多网站会直接屏蔽这样的请求。

那咋办?

我们可以改成这样:

public static String getPageContent_addHeader(String url) {
		CloseableHttpClient httpclient = HttpClients.createDefault();

		try {
			HttpGet httpget = new HttpGet(url);
			httpget.addHeader("Accept", Accept);
			httpget.addHeader("Accept-Charset", Accept_Charset);
			httpget.addHeader("Accept-Encoding", Accept_EnCoding);
			httpget.addHeader("Accept-Language", Accept_Language);
			httpget.addHeader("User-Agent", User_Agent);
			ResponseHandler responseHandler = new ResponseHandler() {

				public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
					int status = response.getStatusLine().getStatusCode();
					if (status >= 200 && status < 300) {
						HttpEntity entity = response.getEntity();
						System.out.println(status);
						return entity != null ? EntityUtils.toString(entity) : null;
					} else {
						System.out.println(status);
						Date date = new Date();
						System.out.println(date);
						System.exit(0);
						throw new ClientProtocolException("Unexpected response status: " + status);
					}
				}
			};
			String responseBody = httpclient.execute(httpget, responseHandler);
			return responseBody;
		} catch (Exception e) {
			logger.error(e);
		} finally {
			try {
				httpclient.close();
			} catch (IOException e) {
				// TODO 自动生成的 catch 块
				logger.error("httpclient未正常关闭");
			}
		}
		return null;
	}

加上了些头请求,如下:

private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
	private static String Accept = "text/html";
	private static String Accept_Charset = "utf-8";
	private static String Accept_EnCoding = "gzip";
	private static String Accept_Language = "en-Us,en";

 

你可能感兴趣的:(爬虫,java)