Java实现从学校教务网上爬取数据(二)—— 获取课表信息以及简单处理

上一篇博客写下通过HttpClient的post方式实现了虚拟登陆教务网,登陆成功之后,获取课表就好说了。登录目的是为了获取cookies,不过上一篇的代码好像并没有管理cookies啊。其实,httpClient4.x就开始支持自动管理cookies,也就是说,只要一直使用同一个HttpClient实例,就不需要管网站返回过来的cookies了。那样的话,只需要再使用登录时使用的HttpClient实例来发送get请求到课程表所在的页面就可以跳转到课程表的页面了。


第一步:跳转到课程表页面

由于已经获取了cookie实例,不懂的可以参照我的上一篇博客 Java实现从学校教务网上爬取数据(一)—— 虚拟登陆,可以直接进行get请求进入课程表所在的网页

    /*检查是否已经登陆成功*/  
		if(httpClient==null)
			return null;

		HttpGet httpGet = new HttpGet(SURL);
		HttpResponse response;
		try {
			response = httpClient.execute(httpGet);
		} catch (ClientProtocolException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}



第二步:获取页面的html文本

执行execute()方法之后会返回一个HttpResponse对象,服务器返回的所有信息都会包含在里面,调用getEntity()方法获取到一个HttpEntity实例,然后再用EntityUtils.toString这个静态方法将其转化成String 即可

               HttpEntity entity = response.getEntity();
	       String htmlTxt = EntityUtils.toString(entity, "utf-8");//防止出现中文乱码


第三步:运用jsoup对html文本进行简单筛选和整理 

虽然已经获得了课表的信息,但是信息确实html文本,根本看不出太多有用的信息

Java实现从学校教务网上爬取数据(二)—— 获取课表信息以及简单处理_第1张图片

这时就要使用jsoup对这些文本进行筛选,提取出有用的信息,并封装成类

我根据我们学校教务网的课程表的内容进行了简单的提取,合成一个类

         class ScheduleItem {
	        private String id;           //课程ID
	        private String name;		//课程名
	        private String message;
                private String teachers;		//课程老师

		public String getId() {
			return id;
		}
		public void setId(String id) {
			this.id = id;
		}
		public String getName() {
			return name;
		}
		public void setName(String name) {
			this.name = name;
		}
		public String getMessage() {
			return message;
		}
		public void setMessage(String message) {
			this.message = message;
		}
		public String getTeachers() {
			return teachers;
		}
		public void setTeachers(String teachers) {
			this.teachers = teachers;
		}
	  }

 通过jsoup筛选和封装课程类的代码如下:

                        schedule = new ArrayList();
			Document doc = Jsoup.parse(htmlTxt);
			Elements trs = doc.select("table").select("tr");
			for(int i = 3;i < trs.size() - 1;i++){
				Elements tds = trs.get(i).select("td");
				ScheduleItem item = new ScheduleItem();
				item.setId(tds.get(0).text());
				item.setName(tds.get(2).text());
				item.setTeachers(tds.get(4).text());
				item.setMessage(tds.get(5).text());
				schedule.add(item);
                        }


 整个功能类的代码如下:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.Consts;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;


public class AnalogLogin{

	private static final String URL = "xxxxxx";//访问的登陆网址
	private static final String SURL = "xxxxxx";//课程表所在网址
	private static HttpClient httpClient;

	/** 
	 * 登陆到教务系统 
	 * @author xuan
	 * @param userName 用户名
	 * @param password 密码
	 * @return 成功返回true 失败返回false
	 *  
	 */  
	public boolean login(String userName,String password){  
		httpClient = new DefaultHttpClient(new ThreadSafeClientConnManager());


		HttpPost httpost = new HttpPost(URL);  
		List nvps = new ArrayList();

		nvps.add(new BasicNameValuePair("userName", userName));   
		nvps.add(new BasicNameValuePair("password", password));   
		nvps.add(new BasicNameValuePair("returnUrl", "null"));  

		/*设置字符*/  
		httpost.setEntity(new UrlEncodedFormEntity(nvps, Consts.UTF_8));  

		/*尝试登陆*/  
		HttpResponse response;  
		try {     
			response = httpClient.execute(httpost);  

			/*验证是否请求和响应都成功*/
			if(response.getStatusLine().getStatusCode() == 200){  
				return true;  
			}else{  
				httpClient = null;
				return false;  
			}  
		} catch (ClientProtocolException e) {  
			e.printStackTrace();  
		} catch (IOException e) {  
			e.printStackTrace();  
		}  

		return false;  
	}


	public List getSchedule(){
		/*检查是否已经登陆成功*/  
		if(httpClient==null)
			return null;

		HttpGet httpGet = new HttpGet(SURL);
		HttpResponse response;
		try {
			response = httpClient.execute(httpGet);
			String htmlTxt = null;
			/*验证是否请求和响应都成功*/
			if(response.getStatusLine().getStatusCode() == 200){
				HttpEntity entity = response.getEntity();
				htmlTxt = EntityUtils.toString(entity, "utf-8");
			}else {
                                httpClient = null;
				return null;
			}
			Listschedule = new ArrayList();
			Document doc = Jsoup.parse(htmlTxt);
			Elements trs = doc.select("table").select("tr");
			for(int i = 3;i

终于完成了一项任务了,又可以好好睡觉了。

你可能感兴趣的:(Java)