我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!!
昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧!
http://myportal.sxu.edu.cn/login.portal
以下信息是通过charm浏览器抓包(快捷键F12)获得的:
1. http://myportal.sxu.edu.cn/
请求:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg
Host:myportal.sxu.edu.cn
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
收到:
Cache-Control:no-cache="set-cookie, set-cookie2"
Content-Language:zh-CN
Content-Length:8252
Content-Type:text/html;charset=utf-8
Date:Sun, 09 Jul 2017 09:04:57 GMT
Expires:Thu, 01 Dec 1994 16:00:00 GMT
Server:IBM_HTTP_Server
Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn
Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/
2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254
请求:
Accept:image/webp,image/*,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
Host:myportal.sxu.edu.cn
Referer:http://myportal.sxu.edu.cn/
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
3. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_=
请求:
Accept:text/javascript, text/html, application/xml, text/xml, */*
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
Host:myportal.sxu.edu.cn
Referer:http://myportal.sxu.edu.cn/
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
X-Prototype-Version:1.5.0
X-Requested-With:XMLHttpRequest
4. http://myportal.sxu.edu.cn/userPasswordValidate.portal
Post请求:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:173
Content-Type:application/x-www-form-urlencoded
Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
Host:myportal.sxu.edu.cn
Origin:http://myportal.sxu.edu.cn
Referer:http://myportal.sxu.edu.cn/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
参数:
Login.Token1:2014241032
//密码
Login.Token2:**********
goto:http://myportal.sxu.edu.cn/loginSuccess.portal
gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal
收到:
Cache-Control:no-cache
Content-Language:zh-CN
Content-Length:83
Content-Type:text/html; charset=UTF-8
Date:Sun, 09 Jul 2017 09:12:08 GMT
Expires:Thu, 01 Dec 1994 16:00:00 GMT
Server:IBM_HTTP_Server
Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn
5. http://myportal.sxu.edu.cn/index.portal
请求:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23
Host:myportal.sxu.edu.cn
Referer:http://myportal.sxu.edu.cn/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
从上面的抓包来看,爬取信息门户的关键是获得
以下两个cookie:
JSESSIONID
iPlanetDirectoryPro
JSESSIONID是在第一次请求登录网页时获得,
而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得
请求userPasswordValidate.portal需要一个JSESSIONID
还需要四个参数,其中:
//账号
Login.Token1:2014241032
//密码
Login.Token2:**********
另外两个参数照抄.
由上分析可得:
我们的爬虫需要请求的页面如下:
1. 请求login.portal,获得JSESSIONID
2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro
3. 爬取数据
package info_system;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HeaderElementIterator;
import org.apache.http.HeaderIterator;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.CookieStore;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.conn.ConnectionKeepAliveStrategy;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.message.BasicHeaderElementIterator;
import org.apache.http.protocol.HTTP;
import org.apache.http.protocol.HttpContext;
import org.apache.http.util.EntityUtils;
import utils.ImageUtils;
public class Test {
public static final String host = "myportal.sxu.edu.cn";
public static final String url1 = "/login.portal";
public static final String url2 = "/captchaGenerate.portal";
public static final String url3 = "/captchaValidate.portal";
public static final String url4 = "/userPasswordValidate.portal";
public static final String url5 = "/index.portal";
public static void main(String[] args) throws URISyntaxException, ClientProtocolException, IOException {
ConnectionKeepAliveStrategy myStrategy = new ConnectionKeepAliveStrategy(){
@Override
public long getKeepAliveDuration(HttpResponse response, HttpContext context) {
// Honor 'keep-alive' header
HeaderElementIterator it = new BasicHeaderElementIterator(response.headerIterator(HTTP.CONN_KEEP_ALIVE));
while (it.hasNext()) {
HeaderElement he = it.nextElement();
String param = he.getName();
String value = he.getValue();
if (value != null && param.equalsIgnoreCase("timeout")) {
try {
return Long.parseLong(value) * 1000;
} catch(NumberFormatException ignore) {
}
}
}
return 10*1000;
}
};
CookieStore cookieStore = new BasicCookieStore();
BasicClientCookie cookie = new BasicClientCookie("name", "value");
cookie.setPath("/");
cookie.setAttribute("JSESSIONID", "0000VrUJvmhi3ZW002mOu_e1czy:1b4e17j2v");
CloseableHttpClient httpclient = HttpClients.custom()
.setDefaultCookieStore(cookieStore)
.setKeepAliveStrategy(myStrategy)
.build();
//1.请求登录主页,获取登录主页的cookie
URI uri1 = new URIBuilder()
.setScheme("http")
.setHost(host)
.setPath(url1)
.build();
HttpGet httpGet = new HttpGet(uri1);
ResponseHandler responseHandler = new ResponseHandler() {
@Override
public BasicClientCookie handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
HeaderIterator hi = response.headerIterator();
while(hi.hasNext()){
Header h = (Header) hi.next();
System.out.println(h.getName()+" --> "+h.getValue());
}
return null;
}
};
httpclient.execute(httpGet,responseHandler);
cookieStore.getCookies().forEach(e->System.out.println(e));
boolean b = false;
/*
//2.请求验证码
URI uri2 = new URIBuilder()
.setScheme("http")
.setHost(host)
.setPath(url2)
.setParameter("s", "0.5123204417293254")
.build();
HttpGet httpGet2 = new HttpGet(uri2);
do{
ResponseHandler responseHandler2 = new ResponseHandler() {
@Override
public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
try {
ImageUtils.writeImg("test.jpg", response.getEntity().getContent());
return true;
} catch (Exception e) {
return false;
}
}
};
b = httpclient.execute(httpGet2,responseHandler2);
}while(!b);
//手动输入验证码:
@SuppressWarnings("resource")
String captcha = new java.util.Scanner(System.in).nextLine();
//3. 请求验证码验证
URI uri3 = new URIBuilder()
.setScheme("http")
.setHost(host)
.setPath(url3)
.setParameter("captcha", captcha)
.setParameter("what", "captcha")
.setParameter("value", captcha)
.setParameter("_", "")
.build();
HttpGet httpGet3 = new HttpGet(uri3);
final String error = "验证码非法";
ResponseHandler responseHandler3 = new ResponseHandler() {
@Override
public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
try {
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
if(s.equals(error)){
return false;
}
return true;
} catch (Exception e) {
return false;
}
}
};
b = httpclient.execute(httpGet3,responseHandler3);
if(b)
System.out.println("验证码识别成功");
*/
//休息一会,等待服务器响应
try {
Thread.sleep(1000);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
//4. 请求账号和密码验证
URI uri4 = new URIBuilder()
.setScheme("http")
.setHost(host)
.setPath(url4)
.setParameter("Login.Token1", "2014241032")
//此处参数为密码
.setParameter("Login.Token2", "**********")
.setParameter("goto", "http://myportal.sxu.edu.cn/loginSuccess.portal")
.setParameter("gotoOnFail", "http://myportal.sxu.edu.cn/loginFailure.portal")
.build();
HttpPost httpPost4 = new HttpPost(uri4);
ResponseHandler responseHandler4 = new ResponseHandler() {
@Override
public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
try {
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
if(s.contains("用户不存在或密码错误")){
return false;
}
return true;
} catch (Exception e) {
return false;
}
}
};
b = httpclient.execute(httpPost4,responseHandler4);
if(b){
System.out.println("验证成功");
}
//5. 请求主页
URI uri5 = new URIBuilder()
.setScheme("http")
.setHost(host)
.setPath(url5)
.build();
HttpGet httpGet5 = new HttpGet(uri5);
ResponseHandler responseHandler5 = new ResponseHandler() {
@Override
public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
try {
String s = EntityUtils.toString(response.getEntity());
//System.out.println(s);
if(s.contains("验证码: ")){
return false;
}
return true;
} catch (Exception e) {
return false;
}
}
};
b = httpclient.execute(httpGet5, responseHandler5);
if(b){
System.out.println("获取主页成功");
}else{
System.out.println("获取主页失败");
}
}
}
//用于验证码图像保存至本地
package utils;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
public class ImageUtils {
/**
* 把图像流读取成byte[]
* @param inStream
* @return
* @throws Exception
*/
public static byte[] readImg(InputStream inStream) throws Exception{
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
//创建一个Buffer字符串
byte[] buffer = new byte[1024];
//每次读取的字符串长度,如果为-1,代表全部读取完毕
int len = 0;
//使用一个输入流从buffer里把数据读取出来
while( (len=inStream.read(buffer)) != -1 ){
//用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度
outStream.write(buffer, 0, len);
}
//关闭输入流
inStream.close();
//把outStream里的数据写入内存
return outStream.toByteArray();
}
/**
* 将imgIs图像流写入到本地imgPath中
* @param imgPath
* @param imgIs
* @throws Exception
*/
public static void writeImg(String imgPath,InputStream imgIs) throws Exception{
//得到图片的二进制数据,以二进制封装得到数据,具有通用性
byte[] data = readImg(imgIs);
//new一个文件对象用来保存图片,默认保存当前工程根目录
File imageFile = new File(imgPath);
//创建输出流
FileOutputStream outStream = new FileOutputStream(imageFile);
//写入数据
outStream.write(data);
//关闭输出流
outStream.close();
}
}