本案例可以获取宿迁论坛--关注宿迁etc下面帖子楼主发表的【标题、 发表时间、网站名、 发表人、 内容(帖子正文)、访问链接】数据。
思想就是通过Java访问的链接,然后拿到html字符串,然后就是解析链接等需要的数据。
技术上使用了Jsoup+HtmlUnit:
采用htmlunit获取网页(官网地址http://htmlunit.sourceforge.net/)
采用jsoup解析网页,获取数据和链接.(文档https://jsoup.org/apidocs/)
使用方法:
第一步:
创建一个WebClient对象,
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage page= webClient.getPage(“https://www.baidu.com”);//获取网页页面
String pageContent= page.getTitleText(); //获取页面的TITLE
pageContent= page.asXml();//获取页面的XML代码
webClient.close(); //关闭webclient
第二步:
使用jsoup,解析网页内容,使用jquery语法方便地进行dom操作
Document doc = Jsoup.parse(pageContent);
String title = doc.title().toString();//获取网页标题
Elements lg = doc.getElementById("lg").getElementsByTag("img");//获取指定id下面的img
String picUrl = lg.attr("src");//获取图片的路径
示例解析地址:
1、http://www.sqee.cn/thread-884131-1-5.html
2、http://www.sqee.cn/thread-883835-1-1.html
解析结果:
1、{
"announceTime":"2017/6/20 11:38:40",
"announceUser":"4704545000",
"content":"我想请问大家 有谁知道 宿迁 实验小学黄河分校2017年啥时候开始招生啊?或者有招生办的电话望告知一声。谢谢大家!!!!",
"title":"关于小学上学问题 ",
"webSiteName":" 关注宿迁 宿迁论坛|鼎鼎有民|大宿网"
}
2、{
"announceTime":"2017/6/18 08:53:20",
"announceUser":"柠檬草",
"content":"三台山现在几点关门?听说现在营业时间延长了,具体是几点呢?有人知道吗?",
"title":"三台山现在几点关门? ",
"webSiteName":" 娱乐旅游 宿迁论坛|鼎鼎有民|大宿网"
}
示例代码:
public class PartPoliticalInfoService {
public static void main(String[] args) {
// 测试
System.out
.println(JSON
.toJSONString(getSqcInfo("http://www.sqee.cn/thread-883835-1-1.html")));
}
// 解析宿迁论坛,关注宿迁等里面帖子楼主发表的相关信息
public static PartPoliticalInfo getSqcInfo(String url) {
PartPoliticalInfo politicalInfo;
String pageContent, title = null, webSiteName = null, announceTime = null, announceUser = null, content = null;
// step 1 : 使用htmlunit,获取指定网址的网页的内容,该网页中的脚本已被执行
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_31);
// htmlunit 对css和javascript的支持不好,所以请关闭之
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage page;
try {
page = webClient.getPage(url);
pageContent = page.asXml();
webClient.close();
// step 2 : 使用jsoup,解析网页内容,使用jquery语法方便地进行dom操作
Document doc = Jsoup.parse(pageContent);
Elements authi = doc.getElementsByClass("authi");
Elements t_fsz = doc.getElementsByClass("t_fsz");
String arrTitle[] = doc.title().toString().split("\\-");
if (arrTitle.length == 2) {
title = arrTitle[0];
webSiteName = arrTitle[1];
}
if (authi.size() >= 2) {
Elements eAnnounceUser = authi.get(0).getElementsByTag("a");
Elements eAnnounceTimeFormat1 = authi.get(1)
.getElementsByTag("span").get(0).getAllElements();
Elements eAnnounceTimeFormat2 = authi.get(1).getElementsByTag(
"em");
announceTime = !eAnnounceTimeFormat1.attr("title").equals("")
? eAnnounceTimeFormat1.attr("title")
: eAnnounceTimeFormat2.text();
announceUser = eAnnounceUser.text();
String timeTotal[] = announceTime.split(" ");
String dateStr = timeTotal[timeTotal.length - 2];
String timeStr = timeTotal[timeTotal.length - 1];
announceTime = dateStr.replaceAll("-", "/") + " " + timeStr;
}
if (t_fsz.size() >= 1) {
Elements eContent = t_fsz.get(0).getElementsByTag("td");
content = eContent.text();
}
} catch (UnknownHostException e) {
title = "UnknownHostException";
} catch (FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
politicalInfo = new PartPoliticalInfo(title, announceTime, webSiteName,
announceUser, content);
return politicalInfo;
}
}
实体类:
public class PartPoliticalInfo {
private String title;// 标题
private String announceTime;// 发表时间
private String webSiteName;// 网站名
private String announceUser;// 发表人
private String content;// 内容(帖子正文)
private String url;// 访问链接
public PartPoliticalInfo() {
super();
}
public PartPoliticalInfo(String title, String announceTime,
String webSiteName, String announceUser, String content) {
super();
this.title = title;
this.announceTime = announceTime;
this.webSiteName = webSiteName;
this.announceUser = announceUser;
this.content = content;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getAnnounceTime() {
return announceTime;
}
public void setAnnounceTime(String announceTime) {
this.announceTime = announceTime;
}
public String getWebSiteName() {
return webSiteName;
}
public void setWebSiteName(String webSiteName) {
this.webSiteName = webSiteName;
}
public String getAnnounceUser() {
return announceUser;
}
public void setAnnounceUser(String announceUser) {
this.announceUser = announceUser;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
}
总结:
本案例可以做到对指定Url的网页页面进行获取数据
因为是爬去指定网页的数据,所以需要开发人员对网页的源代码有一定的分析能力。
缺点:
Jsoup抓取不到js执行后的数据, HtmlUnit支持也不是很好(也可能是本人使用的方式不对)