HttpClient + Jsoup 模拟登陆,解析HTML获取信息
最近在做一个校园综合Android客户端,主要是想把学校各类网站信息进行整合,放在一个平台上,供学校学生阅览。
思路如下:
拿广东工业大学图书馆网站作为一个例子
实现目标:用个人账号登陆图书馆并获取到个人借阅情况。
登陆地址 http://222.200.98.171:81/login.aspx
这里会用到Chrome的开发者工具(浏览器按F12可以开启)
打开登陆界面的源码,下面是源码中的form标签
里面很多代码,我们要从中提取出我们登陆所需要的表单信息,input 和 select 这些标签都是作为登陆表单内容,这里只有input标签我们就提取它就好了,代码如下:
initLoginParmas(String userName,StringpassWord)和getLoginFormData(String url)两个方法
- /**
- * 初始化参数
- *
- * @param userName
- * @param passWord
- * @return
- * @throws ParseException
- * @throws IOException
- */
- public static List<NameValuePair> initLoginParmas(String userName,
- String passWord) throws ParseException, IOException {
- List<NameValuePair> parmasList = new ArrayList<NameValuePair>();
- HashMap<String, String> parmasMap = getLoginFormData(LoginUrl);
- Set<String> keySet = parmasMap.keySet();
-
- for (String temp : keySet) {
- if (temp.contains("Username")) {
- parmasMap.put(temp, userName);
- } else if (temp.contains("txtPas")) {
- parmasMap.put(temp, passWord);
- }
- }
-
- Set<String> keySet2 = parmasMap.keySet();
- System.out.println("表单内容:");
- for (String temp : keySet2) {
- System.out.println(temp + " = " + parmasMap.get(temp));
- }
- for (String temp : keySet2) {
- parmasList.add(new BasicNameValuePair(temp, parmasMap.get(temp)));
- }
-
- // System.out.println("initParams \n" + parmasMap);
-
- return parmasList;
-
- }
- /**
- * 获取登录表单input内容
- *
- * @param url
- * @return
- * @throws IOException
- * @throws ParseException
- */
- public static HashMap<String, String> getLoginFormData(String url)
- throws ParseException, IOException {
- Document document = Jsoup.parse(getHtml(url));
- Elements element1 = document.getElementsByTag("form");// 找出所有form表单
- Element element = element1.select("[method=post]").first();// 筛选出提交方法为post的表单
- Elements elements = element.select("input[name]");// 把表单中带有name属性的input标签取出
- HashMap<String, String> parmas = new HashMap<String, String>();
- for (Element temp : elements) {
- parmas.put(temp.attr("name"), temp.attr("value"));// 把所有取出的input,取出其name,放入Map中
- }
- return parmas;
- }
最后表单结果是:
表单内容:
- ctl00$ContentPlaceHolder1$txtlogintype = 0
- __VIEWSTATE = /wEPDwULLTE0MjY3MDAxNzcPZBYCZg9kFgoCAQ8PFgIeCEltYWdlVXJsBRt+XGltYWdlc1xoZWFkZXJvcGFjNGdpZi5naWZkZAICDw8WAh4EVGV4dAUt5bm/5Lic5bel5Lia5aSn5a2m5Zu+5Lmm6aaG5Lmm55uu5qOA57Si57O757ufZGQCAw8PFgIfAQUcMjAxM+W5tDAz5pyIMDXml6UgIOaYn+acn+S6jGRkAgQPZBYEZg9kFgQCAQ8WAh4LXyFJdGVtQ291bnQCCBYSAgEPZBYCZg8VAwtzZWFyY2guYXNweAAM55uu5b2V5qOA57SiZAICD2QWAmYPFQMTcGVyaV9uYXZfY2xhc3MuYXNweAAM5YiG57G75a+86IiqZAIDD2QWAmYPFQMOYm9va19yYW5rLmFzcHgADOivu+S5puaMh+W8lWQCBA9kFgJmDxUDCXhzdGIuYXNweAAM5paw5Lmm6YCa5oqlZAIFD2QWAmYPFQMUcmVhZGVycmVjb21tZW5kLmFzcHgADOivu+iAheiNkOi0rWQCBg9kFgJmDxUDE292ZXJkdWVib29rc19mLmFzcHgADOaPkOmGkuacjeWKoWQCBw9kFgJmDxUDEnVzZXIvdXNlcmluZm8uYXNweAAP5oiR55qE5Zu+5Lmm6aaGZAIID2QWAmYPFQMbaHR0cDovL2xpYnJhcnkuZ2R1dC5lZHUuY24vAA/lm77kuabppobpppbpobVkAgkPZBYCAgEPFgIeB1Zpc2libGVoZAIDDxYCHwJmZAIBD2QWBAIDD2QWBAIBDw9kFgIeDGF1dG9jb21wbGV0ZQUDb2ZmZAIHDw8WAh8BZWRkAgUPZBYGAgEPEGRkFgFmZAIDDxBkZBYBZmQCBQ8PZBYCHwQFA29mZmQCBQ8PFgIfAQWlAUNvcHlyaWdodCAmY29weTsyMDA4LTIwMDkuIFNVTENNSVMgT1BBQyA0LjAxIG9mIFNoZW56aGVuIFVuaXZlcnNpdHkgTGlicmFyeS4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuPGJyIC8+54mI5p2D5omA5pyJ77ya5rex5Zyz5aSn5a2m5Zu+5Lmm6aaGIEUtbWFpbDpzenVsaWJAc3p1LmVkdS5jbmRkZL5QuJMrEZz+0UxuTVpXZ/EaY5A4
- ctl00$ContentPlaceHolder1$txtPas_Lib =密码不告诉你
- __EVENTVALIDATION = /wEWBQKa7ezdCwKOmK5RApX9wcYGAsP9wL8JAqW86pcIaBhXmFYzd5pGDTk/afln2TfArPw=
- ctl00$ContentPlaceHolder1$txtUsername_Lib = 3110006527
- ctl00$ContentPlaceHolder1$btnLogin_Lib = 登录
接下来是要登陆获取权限也就是获取到Cookie
代码如下:
- /**
- * 图书馆登陆
- *
- * @param context
- * @return 返回登陆后的界面Html代码
- * @throws ClientProtocolException
- * @throws IOException
- */
- public static String login() throws ClientProtocolException, IOException {
- List<NameValuePair> parmasList = new ArrayList<NameValuePair>();
- parmasList = initLoginParmas("3110006527", "2787457");
- HttpPost post = new HttpPost(LoginUrl);
- post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, false);
- // 阻止自动重定向,目的是获取第一个ResponseHeader的Cookie和Location
- post.setHeader("Content-Type",
- "application/x-www-form-urlencoded;charset=gbk");
- // 设置编码为GBK
- post.setEntity(new UrlEncodedFormEntity(parmasList, "GBK"));
- HttpResponse response = new DefaultHttpClient().execute(post);
- cookie = response.getFirstHeader("Set-Cookie").getValue();
- // 取得cookie并保存起来
- // System.out.println("cookie= " + cookie);
- location = response.getFirstHeader("Location").getValue();
- // 重定向地址,目的是连接到主页
- mainUrl = Host + location;
- // 构建主页地址
- String html = getHtml(mainUrl);
- return html;
-
- }
登陆获取Cookie时候会遇到返回状态码是302,这个时候Post方法的话,系统会自动重定向到Location地址,这时候你看到的ResponseHeader已经不是你登陆后返回的那个了,而是你访问重定向地址时候返回的ResponseHeader,而cookie是含在登陆时候返回的ResponseHeader里面所以特别要注意添加语句
- post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS,false);
给Post设置参数,这样就会阻止重定向,从而可以获取Cookie和Location(为了访问主页界面)
- cookie =response.getFirstHeader("Set-Cookie").getValue();
接下来需要做的是根据Location得到主页地址,用Jsoup去解析主页,分析出我的借书情况的页面地址
接下来我们访问其他网页的时候就需要用到cookie 了,所以在用post或者get方法的时候要调用addHeader()或者setHeader();把Cookie设置进去
- /**
- * 获取网页HTML源代码
- *
- * @param url
- * @return
- * @throws ParseException
- * @throws IOException
- */
-
- private static String getHtml(String url) throws ParseException,
- IOException {
- // TODO Auto-generated method stub
- HttpGet get = new HttpGet(url);
- if ("" != cookie) {
- get.addHeader("Cookie", cookie);
- }
- HttpResponse httpResponse = new DefaultHttpClient().execute(get);
- HttpEntity entity = httpResponse.getEntity();
- return EntityUtils.toString(entity);
- }
通过Chrome浏览器分析页面源码,可以看到该标签
- <a href="bookborrowed.aspx" >当前借阅情况和续借</a>
bookborrowed.aspx 这一段就是我们需要的
获取代码如下:
- public static void getMyBorrowedBooks() {
- try {
- Document document = Jsoup.parse(login());
- Elements elements1 = document
- .getElementsContainingOwnText("当前借阅情况和续借");// 通过text关键字找到所要的<a>标签
- String url = elements1.first().attr("href");
- borrowedBooksUrl = mainUrl.substring(0,
- mainUrl.lastIndexOf("/") + 1) + url;// 取值和mainUrl进行拼凑组织借阅情况地址
- getBookBorrowedData(getHtml(borrowedBooksUrl));
-
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
获取到借书情况的地址后,我们就去访问这个地址,获取源码。
我们所需要的事这部分的数据(只截取一部分):
- <tr>
- <td width="5%">
-
-
- 续满
-
-
- </td>
- <td width="10%">2013-04-10</td>
- <td width="35%"><a href="../bookinfo.aspx?ctrlno=571892" target="_blank">编写高质量代码 [专著]:改善Java程序的151个建议=Writing solw Java cove:151 suggestons to improve your Java program/秦小波著</a></td>
- <td width="5%"> </td>
- <td width="8%">中文图书</td>
- <td width="7%">A2973844</td>
- <td width="10%">2012-12-05</td>
- </tr>
-
- <tr>
通过下面代码 用Jsoup进行筛选
- /**
- * 获取借书情况具体数据(List<BookEntity>)
- *
- * @param src
- * @return List<BookEntity>
- */
- private static List<BookEntity> getBookBorrowedData(String src) {
- List<BookEntity> data = new ArrayList<BookEntity>();
- Document document = Jsoup.parse(src);
- Element element = document.select("[id=borrowedcontent]").first()
- .getElementsByTag("table").first();
- Elements elements2 = element.getElementsByTag("tr");
- for (Element temp2 : elements2) {
- Elements elements3 = temp2.getElementsByTag("td");
- BookEntity entity = new test().new BookEntity()
- .setIsFullData(elements3.get(0).text())
- .setData2Return(elements3.get(1).text())
- .setName(elements3.get(2).text())
- .setData2Borrowed(elements3.get(6).text());
- data.add(entity);
-
- }
- data.remove(0);
- System.out.println("借书情况\n");
-
- for (BookEntity temp : data) {
- System.out.println(temp.getName() + "\n" + temp.getData2Borrowed()
- + "\n" + temp.getData2Return() + "\n"
- + temp.getIsFullData());
- }
- return data;
-
- }
最后打印出来结果是:
- 借书情况
-
- 编写高质量代码 [专著]:改善Java程序的151个建议=Writing solw Java cove:151 suggestons to improve your Java program/秦小波著
- 2012-12-05
- 2013-04-10
- 续满
- 疯狂Java [专著]:突破程序员基本功的16课/李刚编著
- 2012-12-05
- 2013-04-10
- 续满
- 程序员修炼之道 [专著]:从小工到专家=The pragmatic programmer:From journeyman to master:评注版/(美)Andrew Hunt,(美)David Thomas著;周爱民,蔡学镛评注
- 2012-11-22
- 2013-04-10
- 续满
- 重构:改善既有代码的设计=Refactoring:improving the design of existing code/(美)Martin Fowler著;熊节译
- 2012-11-22
- 2013-04-10
- 续满
- Android高薪之路 [专著]:Android程序员面试宝典/李宁编著
- 2012-11-29
- 2013-04-10
- 续满
- Android技术内幕 [专著]·系统卷=Android internals·System/杨丰盛著
- 2012-12-04
- 2013-04-10
- 续满
- 我编程, 我快乐 [专著]:程序员职业规划之道=The passionate programmer:creating a remarkable career in software development/(美) Chad Fowler著;于梦瑄译
- 2013-01-17
- 2013-04-17
- 续满
- <strong>完整代码:</strong>
- package moniLogin;
-
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.HashMap;
- import java.util.Iterator;
- import java.util.List;
- import java.util.Set;
-
- import org.apache.http.Header;
- import org.apache.http.HeaderElement;
- import org.apache.http.HttpEntity;
- import org.apache.http.HttpResponse;
- import org.apache.http.NameValuePair;
- import org.apache.http.ParseException;
- import org.apache.http.client.ClientProtocolException;
- import org.apache.http.client.entity.UrlEncodedFormEntity;
- import org.apache.http.client.methods.HttpGet;
- import org.apache.http.client.methods.HttpPost;
- import org.apache.http.client.params.ClientPNames;
- import org.apache.http.impl.client.DefaultHttpClient;
- import org.apache.http.message.BasicNameValuePair;
- import org.apache.http.util.EntityUtils;
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- import org.jsoup.nodes.Element;
- import org.jsoup.select.Elements;
-
- public class test {
- private static String LoginUrl = "http://222.200.98.171:81/login.aspx";
- private static String Host = "http://222.200.98.171:81";
- private static String mainUrl = "";
- private static String borrowedBooksUrl = "";
- private static String cookie = "";
- private static String location = "";
-
- /**
- * @param args
- */
- public static void main(String[] args) {
- // TODO Auto-generated method stub
- getMyBorrowedBooks();
- }
-
- public static void getMyBorrowedBooks() {
- try {
- Document document = Jsoup.parse(login());
- Elements elements1 = document
- .getElementsContainingOwnText("当前借阅情况和续借");// 通过text关键字找到所要的<a>标签
- String url = elements1.first().attr("href");
- borrowedBooksUrl = mainUrl.substring(0,
- mainUrl.lastIndexOf("/") + 1) + url;// 取值和mainUrl进行拼凑组织借阅情况地址
- getBookBorrowedData(getHtml(borrowedBooksUrl));
-
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
-
- /**
- * 获取借书情况具体数据(List<BookEntity>)
- *
- * @param src
- * @return List<BookEntity>
- */
- private static List<BookEntity> getBookBorrowedData(String src) {
- List<BookEntity> data = new ArrayList<BookEntity>();
- Document document = Jsoup.parse(src);
- Element element = document.select("[id=borrowedcontent]").first()
- .getElementsByTag("table").first();
- Elements elements2 = element.getElementsByTag("tr");
- for (Element temp2 : elements2) {
- Elements elements3 = temp2.getElementsByTag("td");
- BookEntity entity = new test().new BookEntity()
- .setIsFullData(elements3.get(0).text())
- .setData2Return(elements3.get(1).text())
- .setName(elements3.get(2).text())
- .setData2Borrowed(elements3.get(6).text());
- data.add(entity);
-
- }
- data.remove(0);
- System.out.println("借书情况\n");
-
- for (BookEntity temp : data) {
- System.out.println(temp.getName() + "\n" + temp.getData2Borrowed()
- + "\n" + temp.getData2Return() + "\n"
- + temp.getIsFullData());
- }
- return data;
-
- }
-
- /**
- * 图书馆登陆
- *
- * @param context
- * @return 返回登陆后的界面Html代码
- * @throws ClientProtocolException
- * @throws IOException
- */
- public static String login() throws ClientProtocolException, IOException {
- List<NameValuePair> parmasList = new ArrayList<NameValuePair>();
- parmasList = initLoginParmas("3110006527", "密码不告诉你");
- HttpPost post = new HttpPost(LoginUrl);
- post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, false);
- // 阻止自动重定向,目的是获取第一个ResponseHeader的Cookie和Location
- post.setHeader("Content-Type",
- "application/x-www-form-urlencoded;charset=gbk");
- // 设置编码为GBK
- post.setEntity(new UrlEncodedFormEntity(parmasList, "GBK"));
- HttpResponse response = new DefaultHttpClient().execute(post);
- cookie = response.getFirstHeader("Set-Cookie").getValue();
- // 取得cookie并保存起来
- // System.out.println("cookie= " + cookie);
- location = response.getFirstHeader("Location").getValue();
- // 重定向地址,目的是连接到主页
- mainUrl = Host + location;
- // 构建主页地址
- String html = getHtml(mainUrl);
- return html;
-
- }
-
- /**
- * 获取网页HTML源代码
- *
- * @param url
- * @return
- * @throws ParseException
- * @throws IOException
- */
-
- private static String getHtml(String url) throws ParseException,
- IOException {
- // TODO Auto-generated method stub
- HttpGet get = new HttpGet(url);
- if ("" != cookie) {
- get.addHeader("Cookie", cookie);
- }
- HttpResponse httpResponse = new DefaultHttpClient().execute(get);
- HttpEntity entity = httpResponse.getEntity();
- return EntityUtils.toString(entity);
- }
-
- /**
- * 初始化参数
- *
- * @param userName
- * @param passWord
- * @return
- * @throws ParseException
- * @throws IOException
- */
- public static List<NameValuePair> initLoginParmas(String userName,
- String passWord) throws ParseException, IOException {
- List<NameValuePair> parmasList = new ArrayList<NameValuePair>();
- HashMap<String, String> parmasMap = getLoginFormData(LoginUrl);
- Set<String> keySet = parmasMap.keySet();
-
- for (String temp : keySet) {
- if (temp.contains("Username")) {
- parmasMap.put(temp, userName);
- } else if (temp.contains("txtPas")) {
- parmasMap.put(temp, passWord);
- }
- }
-
- Set<String> keySet2 = parmasMap.keySet();
- System.out.println("表单内容:");
- for (String temp : keySet2) {
- System.out.println(temp + " = " + parmasMap.get(temp));
- }
- for (String temp : keySet2) {
- parmasList.add(new BasicNameValuePair(temp, parmasMap.get(temp)));
- }
-
- // System.out.println("initParams \n" + parmasMap);
-
- return parmasList;
-
- }
-
- /**
- * 获取登录表单input内容
- *
- * @param url
- * @return
- * @throws IOException
- * @throws ParseException
- */
- public static HashMap<String, String> getLoginFormData(String url)
- throws ParseException, IOException {
- Document document = Jsoup.parse(getHtml(url));
- Elements element1 = document.getElementsByTag("form");// 找出所有form表单
- Element element = element1.select("[method=post]").first();// 筛选出提交方法为post的表单
- Elements elements = element.select("input[name]");// 把表单中带有name属性的input标签取出
- HashMap<String, String> parmas = new HashMap<String, String>();
- for (Element temp : elements) {
- parmas.put(temp.attr("name"), temp.attr("value"));// 把所有取出的input,取出其name,放入Map中
- }
- return parmas;
- }
-
- class BookEntity {
- /**
- * 书名
- *
- */
- private String name;
- /**
- * 可借数
- */
- private String leandableNum;
- /**
- * 索引号
- */
- private String callNumber;
- /**
- * 作者
- */
- private String writer;
- /**
- * 出版社
- */
- private String publisher;
- /**
- * 还书时间
- */
- private String data2Return;
- /**
- * 借书时间
- */
- private String data2Borrowed;
- /**
- * 是否续满
- */
- private String isFullData;
-
- public BookEntity() {
-
- }
-
- public String getName() {
- return name;
- }
-
- public String getLeandableNum() {
- return leandableNum;
- }
-
- public String getCallNumber() {
- return callNumber;
- }
-
- public String getWriter() {
- return writer;
- }
-
- public String getPublisher() {
- return publisher;
- }
-
- public BookEntity setName(String name) {
- this.name = name;
- return this;
- }
-
- public BookEntity setLeandableNum(String leandableNum) {
- this.leandableNum = leandableNum;
- return this;
- }
-
- public BookEntity setCallNumber(String callNumber) {
- this.callNumber = callNumber;
- return this;
- }
-
- public BookEntity setWriter(String writer) {
- this.writer = writer;
- return this;
- }
-
- public BookEntity setPublisher(String publisher) {
- this.publisher = publisher;
- return this;
- }
-
- public String getData2Return() {
- return data2Return;
- }
-
- public String getData2Borrowed() {
- return data2Borrowed;
- }
-
- public String getIsFullData() {
- return isFullData;
- }
-
- public BookEntity setData2Return(String data2Return) {
- this.data2Return = data2Return;
- return this;
- }
-
- public BookEntity setData2Borrowed(String data2Borrowed) {
- this.data2Borrowed = data2Borrowed;
- return this;
- }
-
- public BookEntity setIsFullData(String isFullData) {
- this.isFullData = isFullData;
- return this;
- }
-
- }
-
- }
关于Jsoup怎么使用这里就不详细说了,
详细请查阅这个网站:http://www.open-open.com/jsoup/