正方教务系统 免验证码登录 抓取成绩 Java实现

大二的时候自学了PHP,想着用PHP模拟登录去抓取成绩,可是尝试了很久才堪堪能够登陆到主页,于是不了了之了。
希望本文章能帮助到像我之前一样自己摸索模拟登录教务系统的同学。

    • 效果图
    • 整体流程
    • 工具
    • 踩点
    • 开始搞事情
        • 登录
        • 请求成绩
    • 验证码识别


效果图

正方教务系统 免验证码登录 抓取成绩 Java实现_第1张图片


整体流程

  • 创建设置过CookieStore的HttpClient
  • 使用HttpClient请求验证码,将验证码图片保存在本地,并识别
  • 请求登录页,获取__VIEWSTATE
  • post表单数据,请求登录
  • 登录成功之后就可以拿着cookie为所欲为了
  • post请求成绩页面之后,用Jsoup解析出成绩信息

工具

  • IDEA
  • Fiddler
  • Jsoup
  • HTTPClient
  • JDK 1.8

踩点

搞事情的第一步当然是要先踩点了,网页登录,使用Fiddler抓包结果如下。

正方教务系统 免验证码登录 抓取成绩 Java实现_第2张图片

字段名 获取方式
__VIEWSTATE 通过登录页获取
txtUserName 学号,用户输入
TextBox2 密码,用户输入
txtSecretCode 验证码,系统自动识别填充 (后面一点讲实现)
RadioButtonList1 类型,需要使用 URLEncoder 转换,eg:URLEncoder.encode("学生", "gb2312")

开始搞事情

登录

简单粗暴,上代码,代码我都没怎么整理过,将就着看吧。

完整代码我有点懒这里就不贴了,想要的可以私我。

//登录方法
public static void login(CloseableHttpClient httpClient, User user) {
        while (true) {
            //验证码保存路径
            String codeSavePath = "D:\\jiaowu\\src\\main\\resources\\codeImage\\" + UUID.randomUUID().toString().replaceAll("-", "") + ".gif";
            System.out.println(" [ INFO ] 请求验证码...\n");
            String vcode = new VCodeToText().getCodeText(codeSavePath, httpClient);
            String loginhtml = HttpUtils.getHtml(httpClient, "http://221.232.159.27/default2.aspx", null);
            System.out.println(" [ INFO ] 获取__VIEWSTATE...\n");
            String __VIEWSTATE = Jsoup.parse(loginhtml).select("input[name=__VIEWSTATE]").attr("value");

            //构造登录参数map
            Map params = new HashMap();
            params.put("__VIEWSTATE", __VIEWSTATE);
            params.put("txtUserName", user.getUsername());
            params.put("Textbox1", "");
            params.put("Textbox2", user.getPassword());
            params.put("txtSecretCode", vcode);
            params.put("RadioButtonList1", "%D1%A7%C9%FA");
            params.put("Button1", "");
            params.put("lbLanguage", "");
            params.put("hidPdrs", "");
            params.put("hidsc", "");

            CloseableHttpResponse httpResponse = HttpUtils.post(httpClient, params, "http://221.232.159.27/default2.aspx", null);
            if (httpResponse.getStatusLine().getStatusCode() == 302) {
                System.out.println("登录成功!\n---------------------------");
                break;
            }else if(httpResponse.getStatusLine().getStatusCode() == 200){
                String failHtml = HttpUtils.getHtmlFromResponse(httpResponse);
                String text = Jsoup.parse(failHtml).select("script").get(1).data();
                String substring = text.substring(text.indexOf("'")+1, text.indexOf(")")-1);
                if (substring.equals("验证码不正确!!")){
                    System.out.println(" [INFO] 验证码识别失败,自动进行下一次登录...\n---------------------------");
                }else {
                    System.err.println(substring);
                    System.exit(-1);
                }
            }
        }
    }

可以看到我们判断登录成功是通过302而不是200,这是因为在登录成功之后系统会进行重定向到个人主页。而登录失败的情况下会返回到原页面,状态码才为200。

在登录失败的情况下,如果是由于验证码出错,那么重新获取验证码再次请求,因为验证码的识别率按我当前的训练出来的字库大概在80%,主要是因为对j - i ,l - i 等这些相近字母的识别率不是很高。其实80%准确率的情况下,两次请求都失败的概率就只有4%了,完全可以接受。如果是因为其他原因比如密码错误等,还是因为懒,我这里是打印错误信息之后直接退出了。

请求成绩

同样的,查看成绩页面时使用Fiddler抓包,结果如下:
正方教务系统 免验证码登录 抓取成绩 Java实现_第3张图片

请求头如下:
正方教务系统 免验证码登录 抓取成绩 Java实现_第4张图片

这里需要注意请求地址的组成方式,以及 请求是带Referer 的!!这一点很重要,如果不带Referer访问就会出现Object moved to here,反映到状态码上就是302。

字段名 获取方式
xh 学号(拼接在URL中)
xm 姓名(URLEncoder转换后拼接在URL中)
txtQSCJ 最低成绩,这其实是一个筛选功能,值直接设置就完事了,不嫌麻烦页面也能抓到不过值都是默认值0
txtZZCJ 最高成绩,同上,默认值100
Button2 URLEncoder.encode("在校学习成绩查询", "gb2312") 这里具体是啥取决于查询按钮的值

正方教务系统 免验证码登录 抓取成绩 Java实现_第5张图片

由于我们之前在创建HTTPClient的时候就将CookieStore设置进去了,所以我们登陆成功之后,只要拿着同一个HTTPClient访问其它页面就行了。

System.out.print("大帅比,[ " + name + " ]登录成功!是否查看成绩?[y/n]: ");
Scanner scanner = new Scanner(System.in);
String s = scanner.nextLine();
if (s.equals("y")) {
    String querycjUrl = "http://221.232.159.27/xscj.aspx?xh="+account+"&xm="+URLEncoder.encode(name, "gb2312")+"&gnmkdm=N121604";
    String cjhtml = HttpUtils.getHtml(httpClient, querycjUrl,querycjUrl);
    String cj__VIEWSTATE = Jsoup.parse(cjhtml).getElementById("Form1").select("input[name=__VIEWSTATE]").attr("value");

    Map params = new HashMap();
    params.put("__VIEWSTATE", cj__VIEWSTATE);
    params.put("ddlXN", "");
    params.put("ddlXQ", "");
    params.put("txtQSCJ", "0");
    params.put("txtZZCJ", "100");
    params.put("Button2", URLEncoder.encode("在校学习成绩查询", "gb2312"));
    CloseableHttpResponse httpResponse = HttpUtils.post(httpClient, params, querycjUrl, querycjUrl);
    if (httpResponse.getStatusLine().getStatusCode() == 200) {
        String querybody = EntityUtils.toString(httpResponse.getEntity(), "gb2312");
        EntityUtils.consume(httpResponse.getEntity());
        //System.out.println(querybody);
        Elements tbody = Jsoup.parse(querybody).getElementById("DataGrid1").getElementsByTag("tbody");
        Elements trs = tbody.first().select("tr");
        System.out.println(" *********************************************************************** ");
        for (Element tr : trs) {
            Elements tds = tr.select("td");
            System.out.print(tds.get(1).ownText() + "\t\t\t\t\t\t\t\t\t\t" + tds.get(2).ownText() + "\t\t\t\t\t\t\t\t\t\t" + tds.get(4).ownText() + "\n");
        }
    }
}

这段代码没啥好说的,虽然有点乱,不过应该能看懂。贴一下HttpUtils类代码。


import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author: Ant
 * @Date: 2018/09/07 09:29
 * @Description:
 */
public class HttpUtils {

    public static String getHtml(CloseableHttpClient httpClient, String url, String referer) {
        String htmlbody = "";
        HttpGet httpget = new HttpGet(url);
        httpget.setHeader("X-Requested-With", "XMLHttpRequest");
        httpget.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36");
        if (referer != null) {
            httpget.setHeader("Referer", referer);
        }
        httpget.setConfig(RequestConfig.custom().setConnectTimeout(3 * 1000).setConnectionRequestTimeout(3 * 1000).setSocketTimeout(3 * 1000).build());
        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpget);
            int code = response.getStatusLine().getStatusCode();
            HttpEntity entity;
            if (code == 200 || code == 302) {
                entity = response.getEntity();
                htmlbody = EntityUtils.toString(entity, "gb2312");
                EntityUtils.consume(entity);
            }
            response.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }
        return htmlbody;
    }

    public static CloseableHttpResponse post(CloseableHttpClient httpClient, Map params, String targetUrl, String referer) {
        CloseableHttpResponse httpResponse = null;
        try {
            HttpPost post = new HttpPost();
            post.setURI(new URI(targetUrl));
            if (referer!=null){
                post.setHeader("Referer",referer);
            }
            List list = new ArrayList();
            for (Map.Entry elem : params.entrySet()) {
                list.add(new BasicNameValuePair(elem.getKey(), elem.getValue()));
            }
            if (list.size() > 0) {
                UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list, "gb2312");
                post.setEntity(entity);
            }
            httpResponse = httpClient.execute(post);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return httpResponse;
    }

    public static String getHtmlFromResponse(CloseableHttpResponse httpResponse) {
        String htmlbody = "";
        try {
            HttpEntity entity = httpResponse.getEntity();
            htmlbody = EntityUtils.toString(entity, "gb2312");
            EntityUtils.consume(entity);
        } catch (Exception e1) {
            e1.printStackTrace();
        }
        return htmlbody;
    }
}

验证码识别

我这边已经针对正方教务系统的验证码,训练好了一个字库 。关于这个字库是怎么训练的,其实网上有很多,可是都不怎么详细很难看懂,所以我会在另一篇文章里面详细介绍,今天还有其他事要干就先不写了。着急的同学可以私我q 891575283
正方教务系统 免验证码登录 抓取成绩 Java实现_第6张图片

大致流程就是获取到验证码图片之后对图片去背景,切割图片,将图片与训练得到的字库对比,取吻合度最高的作为识别结果。上面的代码中涉及到的两个类我都给贴出来。

VCodeToText .java

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;

import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @author: Ant
 * @Date: 2018/09/07 09:17
 * @Description:
 */
public class VCodeToText {

    /**
     * 保存验证码图片到本地
     * @param savePath
     * @param httpClient
     */
    public void save(String savePath, CloseableHttpClient httpClient){
        try {
            HttpGet httpGet = new HttpGet();
            httpGet.setURI(new URI("http://221.232.159.27/CheckCode.aspx"));
            httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36");

            CloseableHttpResponse response = httpClient.execute(httpGet);

            String codeImagefilePath = savePath;
            if (response.getStatusLine().getStatusCode() == 200) {
                HttpEntity entity = response.getEntity();
                InputStream is = entity.getContent();
                File f = new File(codeImagefilePath);
                if (!f.exists()) f.createNewFile();
                OutputStream os = new FileOutputStream(f);
                int length = -1;
                byte[] bytes = new byte[1024];
                while ((length = is.read(bytes)) != -1) {
                    os.write(bytes, 0, length);
                }
                os.close();
                EntityUtils.consume(entity);
            }
        } catch (URISyntaxException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 请求验证码保存到本地,并进行识别
     * @param savePath
     * @param httpClient
     * @return 验证码识别结果
     */
    public  String getCodeText(String savePath, CloseableHttpClient httpClient){
        String vcode = "";
        try {
            save(savePath,httpClient);
            vcode = VCodeOCR.ocr(savePath);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return vcode;
    }
}

VCodeOCR.java

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * @author: Ant
 * @Date: 2018/09/03 17:33
 * @Description:
 */
public class VCodeOCR {

    private static Map trainMap = null;

    public static int isBlue(int colorInt) {
        Color color = new Color(colorInt);
        int rgb = color.getRed() + color.getGreen() + color.getBlue();
        if (rgb == 153) {
            return 1;
        }
        return 0;
    }

    public static int isBlack(int colorInt) {
        Color color = new Color(colorInt);
        if (color.getRed() + color.getGreen() + color.getBlue() <= 100) {
            return 1;
        }
        return 0;
    }

    /**
     * 去除背景及干扰点,正方教务系统验证码主体为蓝色,所以将蓝色像素转黑,其余像素转白,就能得到白底黑字的验证码了。
     * @param picFile
     * @return
     * @throws Exception
     */
    public static BufferedImage removeBackgroud(String picFile)
            throws Exception {
        BufferedImage img = ImageIO.read(new File(picFile));
        img = img.getSubimage(5, 1, img.getWidth() - 5, img.getHeight() - 2);
        img = img.getSubimage(0, 0, 50, img.getHeight());
        int width = img.getWidth();
        int height = img.getHeight();
        for (int x = 0; x < width; x++) {
            for (int y = 0; y < height; y++) {
                if (isBlue(img.getRGB(x, y)) == 1) {
                    img.setRGB(x, y, Color.BLACK.getRGB());
                } else {
                    img.setRGB(x, y, Color.WHITE.getRGB());
                }
            }
        }
        return img;
    }

    /**
     * 切割验证码
     * @param img
     * @return
     * @throws Exception
     */
    public static List splitImage(BufferedImage img)
            throws Exception {
        List subImgs = new ArrayList();
        int width = img.getWidth() / 4;
        int height = img.getHeight();
        subImgs.add(img.getSubimage(0, 0, width, height));
        subImgs.add(img.getSubimage(width, 0, width, height));
        subImgs.add(img.getSubimage(width * 2, 0, width, height));
        subImgs.add(img.getSubimage(width * 3, 0, width, height));
        return subImgs;
    }

    /**
     * 载入训练好的字库
     *
     * @return
     * @throws Exception
     */
    public static Map loadTrainData() throws Exception {
        if (trainMap == null) {
            Map map = new HashMap();
            File dir = new File("C:\\Users\\Administrator\\Desktop\\image\\train");
            File[] files = dir.listFiles();
            for (File file : files) {
                map.put(ImageIO.read(file), file.getName().charAt(0) + "");
            }
            trainMap = map;
        }
        return trainMap;
    }

    /**
     * 依次对比训练字库,得到相同像素最多的取文件名首字符
     *
     * @param img
     * @param map
     * @return
     */
    public static String getSingleCharOcr(BufferedImage img,
                                          Map map) {
        String result = "#";
        int width = img.getWidth();
        int height = img.getHeight();
        int min = width * height;
        for (BufferedImage bi : map.keySet()) {
            int count = 0;
            if (Math.abs(bi.getWidth() - width) > 2)
                continue;
            int widthmin = width < bi.getWidth() ? width : bi.getWidth();
            int heightmin = height < bi.getHeight() ? height : bi.getHeight();
            Label1:
            for (int x = 0; x < widthmin; ++x) {
                for (int y = 0; y < heightmin; ++y) {
                    if (isBlack(img.getRGB(x, y)) != isBlack(bi.getRGB(x, y))) {
                        count++;
                        if (count >= min)
                            break Label1;
                    }
                }
            }
            if (count < min) {
                min = count;
                result = map.get(bi);
            }
        }
        return result;
    }

    /**
     * 验证码识别
     * @param file 要验证的验证码本地路径
     * @return
     * @throws Exception
     */
    public static String ocr(String file) throws Exception {
        BufferedImage img = removeBackgroud(file);
        List listImg = splitImage(img);
        Map map = loadTrainData();
        String result = "";
        for (BufferedImage bi : listImg) {
            result += getSingleCharOcr(bi, map);
        }
        return result;
    }
}

在以上基础上,访问其它页面比如查课表、查学分,甚至实现真正的一键评教而不是在console用js代码,等等这些功能在弄清楚需要的字段以及对应的URL之后都是很容易就能实现的。最好是能封装成接口,供其它程序直接调用。

如果这篇文章对你有帮助,请帮我点个赞吧~

你可能感兴趣的:(Java,爬虫)