springboot+jsoup实战——爬取并解析学校课程表

前言

最近小程序需要具有查看课程表和成绩的功能,本来用python非常简单,但是python代码加到springboot里以后,出现各种错误,失败了几次,终究还是能力不行,于是转战springboot+jsoup,终究皇天不负有心人!

实战

  • 导入依赖
 <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-aop</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>


        <!-- data -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.44</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid</artifactId>
            <version>1.1.11</version>
        </dependency>

        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>1.3.2</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-redis</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>

        <!-- google guava 工具库 -->
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>26.0-jre</version>
        </dependency>

        <!--apache工具类-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.10</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

所有要用到的都放到一起了,这里主要用到的是jsoup,httpclient

  • 所有配置文件
    springboot+jsoup实战——爬取并解析学校课程表_第1张图片

  • 分析学校官网
    抓包工具使用Fiddler4,首先进入学校官网,输入账号和密码
    springboot+jsoup实战——爬取并解析学校课程表_第2张图片
    输入以后转到fiddler工具Ctrl+x清空信息,然后转回来按登录按钮,查找Result为200的URL,然后查看其WebForms,如下图
    springboot+jsoup实战——爬取并解析学校课程表_第3张图片
    这边是它的传入参数,这里遇到一个小坑,那时候python模拟登录的时候,查找登录按钮的时候找到头大,后来才发现登录按钮是一张图片,所以以后分析问题要冷静,不能想当然,这里的x,y分别是这张图片的取值范围,每次传入都会在这个范围波动
    之后分析课表URL,按照上面的操作方式
    springboot+jsoup实战——爬取并解析学校课程表_第4张图片
    当看到网页源码含有自己的课程时,那就恭喜你,找对了!注意这里的URL是Get请求,写代码是要用到,上图的第一个URL是可以用ie直接看到课表的,因为官网用的是强智系统,所以只能用ie打开,很伤。不过,没办法,唉!路漫漫其修远兮,吾将……

  • 模拟登录(主要代码部分)

    public HttpClient login(String USERNAME, String PASSWORD) {
        HttpClient httpclient = new DefaultHttpClient(new ThreadSafeClientConnManager());
        HttpPost httpost = new HttpPost(URL);


        List<NameValuePair> nvps = new ArrayList<NameValuePair>();

        nvps.add(new BasicNameValuePair("USERNAME", USERNAME));
        nvps.add(new BasicNameValuePair("PASSWORD", PASSWORD));
        nvps.add(new BasicNameValuePair("useDogCode", ""));
        nvps.add(new BasicNameValuePair("x", "37"));
        nvps.add(new BasicNameValuePair("y", "11"));


        /*设置字符*/
        httpost.setEntity(new UrlEncodedFormEntity(nvps, Consts.UTF_8));

        /*尝试登陆*/
        try {
            httpclient.execute(httpost);
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return httpclient;
    }

nvps传入的便是webforms的参数,useDogcode传入的是空值,保持登录连接,返回httpclient,因为Httpclient4可以自动管理cookies,所以只需要这些简单操作

    public HttpEntity getCurriculum(HttpClient httpClient,  String xueqi) {
//        String SURL = "……?method=goListKbByXs&sql=&xnxqh=" + xueqi + "&zc=" + zhou + "&xs0101id=" + USERNAME;
        //爬取周数URL,后面发现加重了服务器压力,便爬取个人全部课表

        String SURL = "……?method=goListKbByXs&istsxx=no&xnxqh="+xueqi+"&zc=&xs0101id="+USERNAME;
        HttpGet httpGet = new HttpGet(SURL);
        HttpResponse re;
        try {
            re = httpClient.execute(httpGet);
            HttpEntity en = re.getEntity();
            return en;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

然后开始Get课表URL,代码上的……?是自己的学校网址,后面的为一些参数,每个网站的都不同,要具体分析。

   HttpEntity en = getCurriculum(login(USERNAME, PASSWORD),  Term());
            String con = null;
            try {
                con = EntityUtils.toString(en, "utf-8");
            } catch (IOException e) {
                e.printStackTrace();
            }

这里的Term()是自己的写的一个获取当前学期的方法,以达到动态改变URL的目的,
全部实现后,con便是自己学校的课程源码了

  • 解析源码
    千辛万苦的获取源码后终于能快乐的解析它了
 public  void turnTo() {
            HttpEntity en = getCurriculum(login(USERNAME, PASSWORD),  Term());
            String con = null;
            try {
                con = EntityUtils.toString(en, "utf-8");
            } catch (IOException e) {
                e.printStackTrace();
            }
            //System.out.println(con);

            Document doc = Jsoup.parse(con);    //i代表周一至周日
            for (int i = 1; i < 8; i++) {
                Element element = doc.getElementById("1-" + i + "-2");      //  "1-"至"5-"代表1至5大节课
                Save(i, element, "1-2", Term());
            }
            for (int i = 1; i < 8; i++) {
                Element element = doc.getElementById("2-" + i + "-2");
                Save(i, element, "3-4", Term());
            }
            for (int i = 1; i < 8; i++) {
                Element element = doc.getElementById("3-" + i + "-2");
                Save(i, element, "5-6", Term());
            }
            for (int i = 1; i < 8; i++) {
                Element element = doc.getElementById("4-" + i + "-2");
                Save(i, element, "7-8", Term());
            }
            for (int i = 1; i < 8; i++) {
                Element element = doc.getElementById("5-" + i + "-2");
                Save(i, element, "9-10", Term());
            }
    }

因为每节课都有一个对应的id,所以查找id就可以得到每节课程,“1-”至“5-”分别代表第几节课,之后就可以存储了。这里不得不说,Jsoup解析源码是真的强大!

  • 实体层
import lombok.Data;

/**
 * @Author free-go
 * @Date Created in 9:49 2019/6/11
 **/
@Data
public class Course {

    private String jwcAccount;  //学号

    private String weekday;     //周几

    private String section;     //第几节课程

    private String subjectName;     //课程名称

    private String className;   //上课班级

    private String teacher;     //老师

    private String weekSeq;     //课程详细周数

    private String weekStr;     //课程周范围

    private String location;    //教室

    private String xnxqh;   //学期

}

  • Dao层
import org.springframework.stereotype.Repository;


/**
 * @author test
 * @date 19-4-28
 * *****************
 * function:
 */
@Repository
public interface CourseDAO {
    int addCourse(Course course);
}

  • xml文件
   <?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper
        PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-mapper.dtd">

    <mapper namespace="curriculum.Dao.CourseDAO">

    <insert id="addCourse" >
    insert into course(jwcAccount,weekday,section,subjectName,className,teacher,weekSeq,weekStr,location,xnxqh) values(#{jwcAccount},#{weekday},#{section},#{subjectName},#{className},#{teacher},#{weekSeq},#{weekStr},#{location},#{xnxqh})
    </insert>

    </mapper>

OK,圆满完成!

你可能感兴趣的:(Java)