利用HttpClient对一个网站进行确定页面的内容抓取,其中从指定URL获取response内容的代码如下:
这是HttpClient推荐的请求网页内容的基本写法,第一次尝试运行,直接被服务器403 forbidden。public final static String getByString(String url) throws Exception {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
ResponseHandler responseHandler = new ResponseHandler() {
public String handleResponse(
final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
System.out.println(status);
return entity != null ? EntityUtils.toString(entity) : null;
} else {
System.out.println(status);
Date date=new Date();
System.out.println(date);
System.exit(0);
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
return responseBody;
} finally {
httpclient.close();
}
}
考虑通过浏览器能访问该网站,但是上述方法不行,因此尝试为httpget加入header属性,使其在服务器看来更像是用户直接访问。
public final static String getByString(String url) throws Exception {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpGet httpget = new HttpGet(url);
httpget.addHeader("Accept", "text/html");
httpget.addHeader("Accept-Charset", "utf-8");
httpget.addHeader("Accept-Encoding", "gzip");
httpget.addHeader("Accept-Language", "en-US,en");
httpget.addHeader("User-Agent",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
ResponseHandler responseHandler = new ResponseHandler() {
public String handleResponse(
final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
System.out.println(status);
return entity != null ? EntityUtils.toString(entity) : null;
} else {
System.out.println(status);
Date date=new Date();
System.out.println(date);
System.exit(0);
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
return responseBody;
} finally {
httpclient.close();
}
}
再次运行,不再被服务器拒绝访问,但是在短时间请求大量网页后,再次被服务器拒绝访问,依旧报403错误。此时通过浏览器访问该网站,同样显示403错误。当出现这种问题的时候,只能重连宽带,或者使用VPN更换代理,改变IP地址。考虑到服务器应该是屏蔽了本机IP地址,因此尝试降低请求频率,在代码中加入sleep()方法,在每次请求后,等待一段时间。并且由于被服务器拒绝访问后并不能通过程序解决,因此在获取到服务器非正常response status时,加入
System.exit(0);
直接让程序退出。
public final static String getByString(String url) throws Exception {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
public final static String getByString(String url) throws Exception {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpGet httpget = new HttpGet(url);
httpget.addHeader("Accept", "text/html");
httpget.addHeader("Accept-Charset", "utf-8");
httpget.addHeader("Accept-Encoding", "gzip");
httpget.addHeader("Accept-Language", "en-US,en");
httpget.addHeader("User-Agent",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
ResponseHandler responseHandler = new ResponseHandler() {
public String handleResponse(
final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
System.out.println(status);
return entity != null ? EntityUtils.toString(entity) : null;
} else {
System.out.println(status);
Date date=new Date();
System.out.println(date);
System.exit(0);
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
Thread.currentThread().sleep(200);
return responseBody;
} finally {
httpclient.close();
}
}