HtmlUnit+Jsoup简单爬虫获取网页数据

本案例可以获取宿迁论坛--关注宿迁etc下面帖子楼主发表的【标题、发表时间、网站名、发表人、内容（帖子正文）、访问链接】数据。

思想就是通过Java访问的链接，然后拿到html字符串，然后就是解析链接等需要的数据。

技术上使用了Jsoup+HtmlUnit:

采用htmlunit获取网页(官网地址http://htmlunit.sourceforge.net/)

采用jsoup解析网页，获取数据和链接.（文档https://jsoup.org/apidocs/）

使用方法：

第一步：

      创建一个WebClient对象，
        WebClient webClient = new WebClient(BrowserVersion.CHROME);
       webClient.getOptions().setJavaScriptEnabled(false);
       webClient.getOptions().setCssEnabled(false);
       HtmlPage page= webClient.getPage(“https://www.baidu.com”);//获取网页页面
        String pageContent= page.getTitleText(); //获取页面的TITLE
   pageContent= page.asXml();//获取页面的XML代码
       webClient.close(); //关闭webclient

第二步：

使用jsoup，解析网页内容，使用jquery语法方便地进行dom操作

        Document doc = Jsoup.parse(pageContent);
       String title = doc.title().toString();//获取网页标题
       Elements lg = doc.getElementById("lg").getElementsByTag("img");//获取指定id下面的img
       String picUrl = lg.attr("src");//获取图片的路径

示例解析地址:

1、http://www.sqee.cn/thread-884131-1-5.html

2、http://www.sqee.cn/thread-883835-1-1.html

解析结果:

1、{
    "announceTime":"2017/6/20 11:38:40",
    "announceUser":"4704545000",
    "content":"我想请问大家有谁知道宿迁实验小学黄河分校2017年啥时候开始招生啊？或者有招生办的电话望告知一声。谢谢大家！！！！",
    "title":"关于小学上学问题 ",
    "webSiteName":" 关注宿迁宿迁论坛|鼎鼎有民|大宿网"
}

2、{
    "announceTime":"2017/6/18 08:53:20",
    "announceUser":"柠檬草",
    "content":"三台山现在几点关门？听说现在营业时间延长了，具体是几点呢？有人知道吗？",
    "title":"三台山现在几点关门？ ",
    "webSiteName":" 娱乐旅游宿迁论坛|鼎鼎有民|大宿网"
}

示例代码：

public class PartPoliticalInfoService {
   public static void main(String[] args) {
       // 测试
       System.out
               .println(JSON
                       .toJSONString(getSqcInfo("http://www.sqee.cn/thread-883835-1-1.html")));
   }

   // 解析宿迁论坛，关注宿迁等里面帖子楼主发表的相关信息
   public static PartPoliticalInfo getSqcInfo(String url) {
       PartPoliticalInfo politicalInfo;
       String pageContent, title = null, webSiteName = null, announceTime = null, announceUser = null, content = null;

       // step 1 : 使用htmlunit，获取指定网址的网页的内容，该网页中的脚本已被执行
       WebClient webClient = new WebClient(BrowserVersion.FIREFOX_31);
       // htmlunit 对css和javascript的支持不好，所以请关闭之
       webClient.getOptions().setJavaScriptEnabled(false);
       webClient.getOptions().setCssEnabled(false);
       HtmlPage page;
       try {
           page = webClient.getPage(url);
           pageContent = page.asXml();
           webClient.close();
           // step 2 : 使用jsoup，解析网页内容，使用jquery语法方便地进行dom操作
           Document doc = Jsoup.parse(pageContent);
           Elements authi = doc.getElementsByClass("authi");
           Elements t_fsz = doc.getElementsByClass("t_fsz");
           String arrTitle[] = doc.title().toString().split("\\-");

           if (arrTitle.length == 2) {
               title = arrTitle[0];
               webSiteName = arrTitle[1];
           }

           if (authi.size() >= 2) {
               Elements eAnnounceUser = authi.get(0).getElementsByTag("a");
               Elements eAnnounceTimeFormat1 = authi.get(1)
                       .getElementsByTag("span").get(0).getAllElements();
               Elements eAnnounceTimeFormat2 = authi.get(1).getElementsByTag(
                       "em");
               announceTime = !eAnnounceTimeFormat1.attr("title").equals("")
                       ? eAnnounceTimeFormat1.attr("title")
                       : eAnnounceTimeFormat2.text();
               announceUser = eAnnounceUser.text();
               String timeTotal[] = announceTime.split(" ");
               String dateStr = timeTotal[timeTotal.length - 2];
               String timeStr = timeTotal[timeTotal.length - 1];
               announceTime = dateStr.replaceAll("-", "/") + " " + timeStr;
           }

           if (t_fsz.size() >= 1) {
               Elements eContent = t_fsz.get(0).getElementsByTag("td");
               content = eContent.text();
           }

       } catch (UnknownHostException e) {
           title = "UnknownHostException";
       } catch (FailingHttpStatusCodeException e) {

e.printStackTrace();
} catch (MalformedURLException e) {

e.printStackTrace();
} catch (IOException e) {

e.printStackTrace();
}

       politicalInfo = new PartPoliticalInfo(title, announceTime, webSiteName,
               announceUser, content);
       return politicalInfo;
   }
}
实体类：

public class PartPoliticalInfo {
   private String title;// 标题
   private String announceTime;// 发表时间
   private String webSiteName;// 网站名
   private String announceUser;// 发表人
   private String content;// 内容（帖子正文）
   private String url;// 访问链接

   public PartPoliticalInfo() {
       super();
   }

   public PartPoliticalInfo(String title, String announceTime,
           String webSiteName, String announceUser, String content) {
       super();
       this.title = title;
       this.announceTime = announceTime;
       this.webSiteName = webSiteName;
       this.announceUser = announceUser;
       this.content = content;

}

   public String getTitle() {
       return title;
   }

   public void setTitle(String title) {
       this.title = title;
   }

   public String getAnnounceTime() {
       return announceTime;
   }

   public void setAnnounceTime(String announceTime) {
       this.announceTime = announceTime;
   }

   public String getWebSiteName() {
       return webSiteName;
   }

   public void setWebSiteName(String webSiteName) {
       this.webSiteName = webSiteName;
   }

   public String getAnnounceUser() {
       return announceUser;
   }

   public void setAnnounceUser(String announceUser) {
       this.announceUser = announceUser;
   }

   public String getContent() {
       return content;
   }

   public void setContent(String content) {
       this.content = content;
   }

   public String getUrl() {
       return url;
   }

   public void setUrl(String url) {
       this.url = url;
   }

}
总结：

本案例可以做到对指定Url的网页页面进行获取数据

因为是爬去指定网页的数据，所以需要开发人员对网页的源代码有一定的分析能力。

缺点：

Jsoup抓取不到js执行后的数据， HtmlUnit支持也不是很好（也可能是本人使用的方式不对）

HtmlUnit+Jsoup简单爬虫获取网页数据

你可能感兴趣的:(爬虫,java,开发工具)