java爬虫教务信息门户(java爬虫04)

我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!!
昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧!
http://myportal.sxu.edu.cn/login.portal

java爬虫教务信息门户(java爬虫04)_第1张图片

1. 抓包

以下信息是通过charm浏览器抓包(快捷键F12)获得的:

1. http://myportal.sxu.edu.cn/
    请求:
        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
        Accept-Encoding:gzip, deflate, sdch
        Accept-Language:zh-CN,zh;q=0.8
        Connection:keep-alive
        Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg
        Host:myportal.sxu.edu.cn
        Upgrade-Insecure-Requests:1
        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36

    收到:
        Cache-Control:no-cache="set-cookie, set-cookie2"
        Content-Language:zh-CN
        Content-Length:8252
        Content-Type:text/html;charset=utf-8
        Date:Sun, 09 Jul 2017 09:04:57 GMT
        Expires:Thu, 01 Dec 1994 16:00:00 GMT
        Server:IBM_HTTP_Server
        Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn
        Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/

2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254
    请求:
        Accept:image/webp,image/*,*/*;q=0.8
        Accept-Encoding:gzip, deflate, sdch
        Accept-Language:zh-CN,zh;q=0.8
        Connection:keep-alive
        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
        Host:myportal.sxu.edu.cn
        Referer:http://myportal.sxu.edu.cn/
        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36

3. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_=
    请求:
        Accept:text/javascript, text/html, application/xml, text/xml, */*
        Accept-Encoding:gzip, deflate, sdch
        Accept-Language:zh-CN,zh;q=0.8
        Connection:keep-alive
        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
        Host:myportal.sxu.edu.cn
        Referer:http://myportal.sxu.edu.cn/
        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
        X-Prototype-Version:1.5.0
        X-Requested-With:XMLHttpRequest

4. http://myportal.sxu.edu.cn/userPasswordValidate.portal
    Post请求:
        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
        Accept-Encoding:gzip, deflate
        Accept-Language:zh-CN,zh;q=0.8
        Cache-Control:max-age=0
        Connection:keep-alive
        Content-Length:173
        Content-Type:application/x-www-form-urlencoded
        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg
        Host:myportal.sxu.edu.cn
        Origin:http://myportal.sxu.edu.cn
        Referer:http://myportal.sxu.edu.cn/
        Upgrade-Insecure-Requests:1
        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36
    参数:
        Login.Token1:2014241032
        //密码
        Login.Token2:**********
        goto:http://myportal.sxu.edu.cn/loginSuccess.portal
        gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal
    收到:
        Cache-Control:no-cache
        Content-Language:zh-CN
        Content-Length:83
        Content-Type:text/html; charset=UTF-8
        Date:Sun, 09 Jul 2017 09:12:08 GMT
        Expires:Thu, 01 Dec 1994 16:00:00 GMT
        Server:IBM_HTTP_Server
        Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn

5. http://myportal.sxu.edu.cn/index.portal
    请求:
        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
        Accept-Encoding:gzip, deflate, sdch
        Accept-Language:zh-CN,zh;q=0.8
        Connection:keep-alive
        Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23
        Host:myportal.sxu.edu.cn
        Referer:http://myportal.sxu.edu.cn/
        Upgrade-Insecure-Requests:1
        User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36

2. 分析

从上面的抓包来看,爬取信息门户的关键是获得
以下两个cookie:

JSESSIONID
iPlanetDirectoryPro

JSESSIONID是在第一次请求登录网页时获得,
而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得
请求userPasswordValidate.portal需要一个JSESSIONID
还需要四个参数,其中:

//账号
Login.Token1:2014241032
//密码
Login.Token2:**********

另外两个参数照抄.

由上分析可得:
我们的爬虫需要请求的页面如下:
1. 请求login.portal,获得JSESSIONID
2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro
3. 爬取数据

3. 写代码

package info_system;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HeaderElementIterator;
import org.apache.http.HeaderIterator;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.CookieStore;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.conn.ConnectionKeepAliveStrategy;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.message.BasicHeaderElementIterator;
import org.apache.http.protocol.HTTP;
import org.apache.http.protocol.HttpContext;
import org.apache.http.util.EntityUtils;

import utils.ImageUtils;

public class Test {
    public static final String host = "myportal.sxu.edu.cn";
    public static final String url1 = "/login.portal";
    public static final String url2 = "/captchaGenerate.portal";
    public static final String url3 = "/captchaValidate.portal";
    public static final String url4 = "/userPasswordValidate.portal";
    public static final String url5 = "/index.portal";

    public static void main(String[] args) throws URISyntaxException, ClientProtocolException, IOException {
        ConnectionKeepAliveStrategy myStrategy = new ConnectionKeepAliveStrategy(){
            @Override
            public long getKeepAliveDuration(HttpResponse response, HttpContext context) {
                // Honor 'keep-alive' header
                HeaderElementIterator it = new BasicHeaderElementIterator(response.headerIterator(HTTP.CONN_KEEP_ALIVE));
                while (it.hasNext()) {
                    HeaderElement he = it.nextElement();
                    String param = he.getName();
                    String value = he.getValue();
                    if (value != null && param.equalsIgnoreCase("timeout")) {
                        try {
                            return Long.parseLong(value) * 1000;
                        } catch(NumberFormatException ignore) {
                        }
                    }
                }
                return 10*1000;
            }
        };

        CookieStore cookieStore = new BasicCookieStore();
        BasicClientCookie cookie = new BasicClientCookie("name", "value");
        cookie.setPath("/");
        cookie.setAttribute("JSESSIONID", "0000VrUJvmhi3ZW002mOu_e1czy:1b4e17j2v");
        CloseableHttpClient httpclient = HttpClients.custom()
                .setDefaultCookieStore(cookieStore)
                .setKeepAliveStrategy(myStrategy)
                .build();

        //1.请求登录主页,获取登录主页的cookie
        URI uri1 = new URIBuilder()
                .setScheme("http")
                .setHost(host)
                .setPath(url1)
                .build();
        HttpGet httpGet = new HttpGet(uri1);
        ResponseHandler responseHandler = new ResponseHandler() {
            @Override
            public BasicClientCookie handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
                HeaderIterator hi = response.headerIterator();
                while(hi.hasNext()){
                    Header h = (Header) hi.next();
                    System.out.println(h.getName()+" --> "+h.getValue());
                }
                return null;
            }
        };
        httpclient.execute(httpGet,responseHandler);
        cookieStore.getCookies().forEach(e->System.out.println(e));
        boolean b = false;
/*
        //2.请求验证码
        URI uri2 = new URIBuilder()
                .setScheme("http")
                .setHost(host)
                .setPath(url2)
                .setParameter("s", "0.5123204417293254")
                .build();
        HttpGet httpGet2 = new HttpGet(uri2);
        do{
            ResponseHandler responseHandler2 = new ResponseHandler() {
                @Override
                public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
                    try {
                        ImageUtils.writeImg("test.jpg", response.getEntity().getContent());
                        return true;
                    } catch (Exception e) {
                        return false;
                    }
                }
            };
            b = httpclient.execute(httpGet2,responseHandler2);
        }while(!b);

        //手动输入验证码:
        @SuppressWarnings("resource")
        String captcha = new java.util.Scanner(System.in).nextLine();

        //3. 请求验证码验证
        URI uri3 = new URIBuilder()
                .setScheme("http")
                .setHost(host)
                .setPath(url3)
                .setParameter("captcha", captcha)
                .setParameter("what", "captcha")
                .setParameter("value", captcha)
                .setParameter("_", "")
                .build();
        HttpGet httpGet3 = new HttpGet(uri3);
        final String error = "验证码非法";
        ResponseHandler responseHandler3 = new ResponseHandler() {
            @Override
            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
                try {
                    String s = EntityUtils.toString(response.getEntity());
                    System.out.println(s);
                    if(s.equals(error)){
                        return false;
                    }
                    return true;
                } catch (Exception e) {
                    return false;
                }
            }
        };
        b = httpclient.execute(httpGet3,responseHandler3);
        if(b)
            System.out.println("验证码识别成功");
*/      
        //休息一会,等待服务器响应
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e1) {
            e1.printStackTrace();
        }
        //4. 请求账号和密码验证
        URI uri4 = new URIBuilder()
                .setScheme("http")
                .setHost(host)
                .setPath(url4)
                .setParameter("Login.Token1", "2014241032")
                //此处参数为密码
                .setParameter("Login.Token2", "**********")
                .setParameter("goto", "http://myportal.sxu.edu.cn/loginSuccess.portal")
                .setParameter("gotoOnFail", "http://myportal.sxu.edu.cn/loginFailure.portal")
                .build();
        HttpPost httpPost4 = new HttpPost(uri4);
        ResponseHandler responseHandler4 = new ResponseHandler() {
            @Override
            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
                try {
                    String s = EntityUtils.toString(response.getEntity());
                    System.out.println(s);
                    if(s.contains("用户不存在或密码错误")){
                        return false;
                    }
                    return true;
                } catch (Exception e) {
                    return false;
                }
            }
        };
        b = httpclient.execute(httpPost4,responseHandler4);
        if(b){
            System.out.println("验证成功");
        }

        //5. 请求主页
        URI uri5 = new URIBuilder()
                .setScheme("http")
                .setHost(host)
                .setPath(url5)
                .build();
        HttpGet httpGet5 = new HttpGet(uri5);
        ResponseHandler responseHandler5 = new ResponseHandler() {
            @Override
            public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
                try {
                    String s = EntityUtils.toString(response.getEntity());
                    //System.out.println(s);
                    if(s.contains("验证码:")){
                        return false;
                    }
                    return true;
                } catch (Exception e) {
                    return false;
                }
            }
        };
        b = httpclient.execute(httpGet5, responseHandler5);
        if(b){
            System.out.println("获取主页成功");
        }else{
            System.out.println("获取主页失败");
        }
    }
}

//用于验证码图像保存至本地

package utils;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

public class ImageUtils {  

    /**
     * 把图像流读取成byte[]
     * @param inStream
     * @return
     * @throws Exception
     */
    public static byte[] readImg(InputStream inStream) throws Exception{  
        ByteArrayOutputStream outStream = new ByteArrayOutputStream();  
        //创建一个Buffer字符串  
        byte[] buffer = new byte[1024];  
        //每次读取的字符串长度,如果为-1,代表全部读取完毕  
        int len = 0;  
        //使用一个输入流从buffer里把数据读取出来  
        while( (len=inStream.read(buffer)) != -1 ){  
            //用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度  
            outStream.write(buffer, 0, len);  
        }  
        //关闭输入流  
        inStream.close();  
        //把outStream里的数据写入内存  
        return outStream.toByteArray();  
    }  

    /**
     * 将imgIs图像流写入到本地imgPath中
     * @param imgPath
     * @param imgIs
     * @throws Exception
     */
    public static void writeImg(String imgPath,InputStream imgIs) throws Exception{
        //得到图片的二进制数据,以二进制封装得到数据,具有通用性  
        byte[] data = readImg(imgIs);  
        //new一个文件对象用来保存图片,默认保存当前工程根目录  
        File imageFile = new File(imgPath);  
        //创建输出流  
        FileOutputStream outStream = new FileOutputStream(imageFile);  
        //写入数据  
        outStream.write(data);  
        //关闭输出流  
        outStream.close();  
    }
}  

你可能感兴趣的:(java爬虫)