Java+MySQL实现网络爬虫程序

有时需要在网上搜集信息，如果需要搜集的是获取方法单一而人工搜集费时费力的信息，比如统计一个网站每个月发了多少篇文章、用了哪些标签，为自然语言处理项目搜集语料，或者为模式识别项目搜集图片等等，就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。

很多网络爬虫都是用Python，Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间，我把程序限制在只扫描本博客地址下的网页（也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容），并从网址中统计出所用的所有标签。只要稍作修改，去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改，可以作为生成博客sitemap的工具。

代码也可以在这里下载：johnhany/WPCrawler。

环境需求

我的开发环境是Windows7 + Eclipse。

需要XAMPP提供通过url访问MySQL数据库的端口。

还要用到三个开源的Java类库：

Apache HttpComponents 4.3 提供HTTP接口，用来向目标网址提交HTTP请求，以获取网页的内容；

HTML Parser 2.0 用来解析网页，从DOM节点中提取网址链接；

MySQL Connector/J 5.1.27 连接Java程序和MySQL，然后就可以用Java代码操作数据库。

代码

代码位于三个文件中，分别是：crawler.java，httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。

crawler.java

 
           package 
           net.johnhany.wpcrawler;  
          
           import 
           java.sql.Connection;   
          
           import 
           java.sql.DriverManager;  
          
           import 
           java.sql.ResultSet;   
          
           import 
           java.sql.SQLException;  
          
           import 
           java.sql.Statement;   
          
           public 
           class 
           crawler {  
          
               public 
           static 
           void 
            main(String args[]) throws 
           Exception {  
          
                   String frontpage = "http://johnhany.net/";  
          
                   Connection conn = null;  
          
                   //connect the MySQL database  
          
                   try 
           {  
          
                       Class.forName("com.mysql.jdbc.Driver");  
          
                       String dburl = "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";  
          
                       conn = DriverManager.getConnection(dburl, "root", "");  
          
                       System.out.println("connection built");  
          
                   } catch 
           (SQLException e) {  
          
                       e.printStackTrace();  
          
                   } catch 
           (ClassNotFoundException e) {  
          
                       e.printStackTrace();  
          
                   }  
          
                   String sql = null;  
          
                   String url = frontpage;  
          
                   Statement stmt = null;  
          
                   ResultSet rs = null;  
          
                   int 
           count = 0;  
          
                   if(conn != null) {   
          
                       //create database and table that will be needed  
          
                       try 
           {  
          
                           sql = "CREATE DATABASE IF NOT EXISTS crawler";  
          
                           stmt = conn.createStatement();  
          
                           stmt.executeUpdate(sql);  
          
                           sql = "USE crawler";   
          
                           stmt = conn.createStatement();  
          
                           stmt.executeUpdate(sql);  
          
                           sql = "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";  
          
                           stmt = conn.createStatement();  
          
                           stmt.executeUpdate(sql);  
          
                           sql = "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";  
          
                           stmt = conn.createStatement();  
          
                           stmt.executeUpdate(sql);  
          
                       } catch 
           (SQLException e) {  
          
                           e.printStackTrace();  
          
                       }  
          
                       //crawl every link in the database  
          
                       while(true) {  
          
                           //get page content of link "url"  
          
                           httpGet.getByString(url,conn);  
          
                           count++;  
          
                           //set boolean value "crawled" to true after crawling this page  
          
                           sql = "UPDATE record SET crawled = 1 WHERE URL = '" 
           + url + "'";  
          
                           stmt = conn.createStatement();  
          
                           if(stmt.executeUpdate(sql) > 0) {  
          
                               //get the next page that has not been crawled yet  
          
                               sql = "SELECT * FROM record WHERE crawled = 0";  
          
                               stmt = conn.createStatement();  
          
                               rs = stmt.executeQuery(sql);  
          
                               if(rs.next()) {  
          
                                   url = rs.getString(2);  
          
                               }else 
           {  
          
                                   //stop crawling if reach the bottom of the list  
          
                                   break;  
          
                               }  
          
                               //set a limit of crawling count  
          
                               if(count > 1000 
            || url == null) {  
          
                                   break;  
          
                               }  
          
                           }  
          
                       }  
          
                       conn.close();  
          
                       conn = null;  
          
                       System.out.println("Done.");  
          
                       System.out.println(count);  
          
                   }  
          
               }  
          
           }

httpGet.java

 
           package 
           net.johnhany.wpcrawler;  
          
           import 
           java.io.IOException;   
          
           import 
           java.sql.Connection;   
          
           import 
           org.apache.http.HttpEntity;  
          
           import 
           org.apache.http.HttpResponse;  
          
           import 
           org.apache.http.client.ClientProtocolException;  
          
           import 
           org.apache.http.client.ResponseHandler;  
          
           import 
           org.apache.http.client.methods.HttpGet;  
          
           import 
           org.apache.http.impl.client.CloseableHttpClient;  
          
           import 
           org.apache.http.impl.client.HttpClients;  
          
           import 
           org.apache.http.util.EntityUtils;  
          
           public 
           class 
           httpGet {  
          
               public 
           final 
           static 
           void 
            getByString(String url, Connection conn) throws 
           Exception {  
          
                   CloseableHttpClient httpclient = HttpClients.createDefault();  
          
                   try 
           {  
          
                       HttpGet httpget = new 
           HttpGet(url);  
          
                       System.out.println("executing request " 
           + httpget.getURI());   
          
                       ResponseHandler responseHandler = new 
           ResponseHandler() {  
          
                           public 
           String handleResponse(  
          
                                   final 
           HttpResponse response) throws 
           ClientProtocolException, IOException {  
          
                               int 
           status = response.getStatusLine().getStatusCode();  
          
                               if 
           (status >= 200 
           && status < 300) {  
          
                                   HttpEntity entity = response.getEntity();  
          
                                   return 
           entity != null 
           ? EntityUtils.toString(entity) : null;  
          
                               } else 
           {  
          
                                   throw 
           new 
            ClientProtocolException("Unexpected response status: " 
           + status);  
          
                               }  
          
                           }  
          
                       };  
          
                       String responseBody = httpclient.execute(httpget, responseHandler);  
          
                       /*  
          
                       //print the content of the page  
          
                       System.out.println("----------------------------------------");  
          
                       System.out.println(responseBody);  
          
                       System.out.println("----------------------------------------");  
          
                       */ 
          
                       parsePage.parseFromString(responseBody,conn);  
          
                   } finally 
            {  
          
                       httpclient.close();  
          
                   }  
          
               }  
          
           }

parsePage.java

 
           package 
           net.johnhany.wpcrawler;  
          
           import 
           java.sql.Connection;   
          
           import 
           java.sql.PreparedStatement;  
          
           import 
           java.sql.ResultSet;   
          
           import 
           java.sql.SQLException;  
          
           import 
           java.sql.Statement;   
          
           import 
           org.htmlparser.Node;   
          
           import 
           org.htmlparser.Parser;  
          
           import 
           org.htmlparser.filters.HasAttributeFilter;  
          
           import 
           org.htmlparser.tags.LinkTag;  
          
           import 
           org.htmlparser.util.NodeList;  
          
           import 
           org.htmlparser.util.ParserException;  
          
           import 
           java.net.URLDecoder;   
          
           public 
           class 
           parsePage {  
          
               public 
           static 
           void 
            parseFromString(String content, Connection conn)  throws 
           Exception {  
          
                   Parser parser = new 
           Parser(content);  
          
                   HasAttributeFilter filter = new 
            HasAttributeFilter("href");  
          
                   try 
           {  
          
                       NodeList list = parser.parse(filter);  
          
                       int 
           count = list.size();   
          
                       //process every link on this page  
          
                       for(int 
           i=0; i 
          
                           Node node = list.elementAt(i);  
          
                           if(node instanceof 
            LinkTag) {  
          
                               LinkTag link = (LinkTag) node;  
          
                               String nextlink = link.extractLink();  
          
                               String mainurl = "http://johnhany.net/";  
          
                               String wpurl = mainurl + "wp-content/";   
          
                               //only save page from "http://johnhany.net"  
          
                               if(nextlink.startsWith(mainurl)) {  
          
                                   String sql = null;  
          
                                   ResultSet rs = null;  
          
                                   PreparedStatement pstmt = null;  
          
                                   Statement stmt = null;  
          
                                   String tag = null;  
          
                                   //do not save any page from "wp-content"  
          
                                   if(nextlink.startsWith(wpurl)) {  
          
                                       continue;  
          
                                   }  
          
                                   try 
           {  
          
                                       //check if the link already exists in the database  
          
                                       sql = "SELECT * FROM record WHERE URL = '" 
           + nextlink + "'";  
          
                                       stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);  
          
                                       rs = stmt.executeQuery(sql);  
          
                                       if(rs.next()) {  
          
                                       }else 
           {  
          
                                           //if the link does not exist in the database, insert it  
          
                                           sql = "INSERT INTO record (URL, crawled) VALUES ('" 
           + nextlink + "',0)";  
          
                                           pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);  
          
                                           pstmt.execute();  
          
                                           System.out.println(nextlink);  
          
                                           //use substring for better comparison performance  
          
                                           nextlink = nextlink.substring(mainurl.length());  
          
                                           //System.out.println(nextlink);  
          
                                           if(nextlink.startsWith("tag/")) {  
          
                                               tag = nextlink.substring(4, nextlink.length()-1);  
          
                                               //decode in UTF-8 for Chinese characters  
          
                                               tag = URLDecoder.decode(tag,"UTF-8");  
          
                                               sql = "INSERT INTO tags (tagname) VALUES ('" 
           + tag + "')";  
          
                                               pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);  
          
                                               //if the links are different from each other, the tags must be different  
          
                                               //so there is no need to check if the tag already exists  
          
                                               pstmt.execute();  
          
                                           }  
          
                                       }  
          
                                   } catch 
           (SQLException e) {  
          
                                       //handle the exceptions  
          
                                       System.out.println("SQLException: " 
           + e.getMessage());   
          
                                       System.out.println("SQLState: " 
           + e.getSQLState());   
          
                                       System.out.println("VendorError: " 
           + e.getErrorCode());   
          
                                   } finally 
            {  
          
                                       //close and release the resources of PreparedStatement, ResultSet and Statement  
          
                                       if(pstmt != null) {   
          
                                           try 
           {  
          
                                               pstmt.close();  
          
                                           } catch 
           (SQLException e2) {}  
          
                                       }  
          
                                       pstmt = null;  
          
                                       if(rs != null) {   
          
                                           try 
           {  
          
                                               rs.close();  
          
                                           } catch 
           (SQLException e1) {}  
          
                                       }  
          
                                       rs = null;  
          
                                       if(stmt != null) {   
          
                                           try 
           {  
          
                                               stmt.close();  
          
                                           } catch 
           (SQLException e3) {}  
          
                                       }  
          
                                       stmt = null;  
          
                                   }  
          
                               }  
          
                           }  
          
                       }  
          
                   } catch 
           (ParserException e) {  
          
                       e.printStackTrace();  
          
                   }  
          
               }  
          
           }

程序原理

所谓“互联网”，是网状结构，任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描，在图论角度来讲，就是对有向图的遍历（链接是从一个网页指向另一个网页，所以是有向的）。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历：这里和这里。我的程序采用的是广度优先方式。

程序从crawler.java的main()开始运行。

 
           Class.forName("com.mysql.jdbc.Driver");  
          
           String dburl = "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";  
          
           conn = DriverManager.getConnection(dburl, "root", "");  
          
           System.out.println("connection built");

首先，调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306，端口值可以在XAMPP主界面看到：

Apache和MySQL都启动之后，在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。

 
           sql = "CREATE DATABASE IF NOT EXISTS crawler";  
          
           stmt = conn.createStatement();  
          
           stmt.executeUpdate(sql);  
          
           sql = "USE crawler";   
          
           stmt = conn.createStatement();  
          
           stmt.executeUpdate(sql);  
          
           sql = "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";  
          
           stmt = conn.createStatement();  
          
           stmt.executeUpdate(sql);  
          
           sql = "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";  
          
           stmt = conn.createStatement();  
          
           stmt.executeUpdate(sql);

连接好数据库后，建立一个名为“crawler”的数据库，在库里建两个表，一个叫“record”，包含字段“recordID”，“URL”和“crawled”，分别记录地址编号、链接地址和地址是否被扫描过；另一个叫“tags”，包含字段“tagnum”和“tagname”，分别记录标签编号和标签名。

 
           while(true) {  
          
               httpGet.getByString(url,conn);  
          
               count++;  
          
               sql = "UPDATE record SET crawled = 1 WHERE URL = '" 
           + url + "'";  
          
               stmt = conn.createStatement();  
          
               if(stmt.executeUpdate(sql) > 0) {  
          
                   sql = "SELECT * FROM record WHERE crawled = 0";  
          
                   stmt = conn.createStatement();  
          
                   rs = stmt.executeQuery(sql);  
          
                   if(rs.next()) {  
          
                       url = rs.getString(2);  
          
                   }else 
           {  
          
                       break;  
          
                   }  
          
               }  
          
           }

接着在一个while循环内依次处理表record内的每个地址。每次处理时，把地址url传递给httpGet.getByString()，然后在表record中把crawled改为true，表明这个地址已经处理过。然后寻找下一个crawled为false的地址，继续处理，直到处理到表尾。

这里需要注意的细节是，执行executeQuery()后，得到了一个ResultSet结构rs，rs包含SQL查询返回的所有行和一个指针，指针指向结果中第一行之前的位置，需要执行一次rs.next()才能让rs的指针指向第一个结果，同时返回true，之后每次执行rs.next()都会把指针移到下一个结果上并返回true，直至再也没有结果时，rs.next()的返回值变成了false。

还有一个细节，在执行建库建表、INSERT、UPDATE时，需要用executeUpdate()；在执行INSERT时，需要使用executeQuery()。executeQuery()总是返回一个ResultSet，executeUpdate()返回符合查询的行数。

httpGet.java的getByString()类负责向所给的网址发送请求，然后下载网页内容。

 
           HttpGet httpget = new 
           HttpGet(url);  
          
           System.out.println("executing request " 
           + httpget.getURI());   
          
           ResponseHandler responseHandler = new 
           ResponseHandler() {  
          
               public 
           String handleResponse(  
          
                       final 
           HttpResponse response) throws 
           ClientProtocolException, IOException {  
          
                   int 
           status = response.getStatusLine().getStatusCode();  
          
                   if 
           (status >= 200 
           && status < 300) {  
          
                       HttpEntity entity = response.getEntity();  
          
                       return 
           entity != null 
           ? EntityUtils.toString(entity) : null;  
          
                   } else 
           {  
          
                       throw 
           new 
            ClientProtocolException("Unexpected response status: " 
           + status);  
          
                   }  
          
               }  
          
           };  
          
           String responseBody = httpclient.execute(httpget, responseHandler);

这段代码是HTTPComponents的HTTP Client组件中给出的样例，在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody，里面保存着网页中的全部字符。

接着，就需要把responseBody传递给parsePage.java的parseFromString类提取链接。

 
           Parser parser = new 
           Parser(content);  
          
           HasAttributeFilter filter = new 
            HasAttributeFilter("href");  
          
           try 
           {  
          
               NodeList list = parser.parse(filter);  
          
               int 
           count = list.size();   
          
               //process every link on this page  
          
               for(int 
           i=0; i 
          
                   Node node = list.elementAt(i);  
          
                   if(node instanceof 
            LinkTag) {

在HTML文件中，链接一般都在a标签的href属性中，所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点，通过在for循环中依次处理每个节点寻找符合要求的标签，可以把网页中的所有链接提取出来。

然后通过nextlink.startsWith()进一步筛选，只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。

 
           sql = "SELECT * FROM record WHERE URL = '" 
           + nextlink + "'";  
          
           stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);  
          
           rs = stmt.executeQuery(sql);  
          
           if(rs.next()) {  
          
           }else 
           {  
          
               //if the link does not exist in the database, insert it  
          
               sql = "INSERT INTO record (URL, crawled) VALUES ('" 
           + nextlink + "',0)";  
          
               pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);  
          
               pstmt.execute();

在表record中查找是否已经存在这个链接，如果存在（rs.next()==true），不做任何处理；如果不存在（rs.next()==false），在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT，所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。

 
           nextlink = nextlink.substring(mainurl.length());  
          
           if(nextlink.startsWith("tag/")) {  
          
               tag = nextlink.substring(4, nextlink.length()-1);  
          
               tag = URLDecoder.decode(tag,"UTF-8");  
          
               sql = "INSERT INTO tags (tagname) VALUES ('" 
           + tag + "')";  
          
               pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);  
          
               pstmt.execute();

去掉链接开头的“http://johnhany.net/”几个字符，提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字，把这给名字提取出来，用UTF-8编码，保证汉字的正常显示，然后存入表tags。类似地还可以加入判断“article/”，“author/”，或“2013/11/”等对其他链接进行归类。

结果

这是两张数据库的截图，显示了程序的部分结果：

在这里可以获得全部输出结果。可以与博客的sitemap比较一下，看看如果想在其基础上实现sitemap生成工具，还要做哪些修改。

转: http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/

Java+MySQL实现网络爬虫程序

你可能感兴趣的:(java)