爬取网站的内容并建立索引,执行搜索功能

下面以爬取51Job为例

爬取和搜索也是现在比较流行的一个功能,想写这么一个小功能。希望以后对你们有帮助。

首先我们要爬取数据必须要建立相应的数据库,我写一个pojo,然后你们可以根据它建立表,数据库使用的MySql,下面都以爬取51Job为例。

写一个pojo并且建立构造方法初始化:

public class Jobcrawl implements java.io.Serializable {
    //主键
    private String toid;
    //工作名称
    private String jobname;
    //公司名称
    private String companyname;
    //公司性质
    private String comtype;
    //工作发布时间
    private String publishtime;
    //工作地点
    private String place;
    //职位要求
    private Integer requirecount;
    //工作年限
    private String workyear;
    //学历
    private String qualifications;
    //工作内容
    private String jobcontent;
    //公司详细信息网址(在51job上的)
    private String url;
    //公司行业
    private String industry;
    //公司规模
    private String comscale;
    // Constructors
    public String getComscale() {
        return comscale;
    }
    public void setComscale(String comscale) {
        this.comscale = comscale;
    }
    public String getIndustry() {
        return industry;
    }
    public void setIndustry(String industry) {
        this.industry = industry;
    }
    /** default constructor */
    public Jobcrawl() {
    }
    /** minimal constructor */
    public Jobcrawl(String toid) {
        this.toid = toid;
    }
    /** full constructor */
    public Jobcrawl(String toid, String jobname, String companyname,String comtype, String publishtime, String place,
Integer requirecount, String workyear, String qualifications,
String jobcontent, String url) {
        this.toid = toid;
        this.jobname = jobname;
        this.companyname = companyname;
        this.comtype = comtype;
        this.publishtime = publishtime;
        this.place = place;
        this.requirecount = requirecount;
        this.workyear = workyear;
        this.qualifications = qualifications;
        this.jobcontent = jobcontent;
        this.url = url;
    }
    public String getToid() {
        return this.toid;
    }
    public void setToid(String toid) {
        this.toid = toid;
    }
    public String getJobname() {
        return this.jobname;
    }
    public void setJobname(String jobname) {
        this.jobname = jobname;
    }
    public String getCompanyname() {
        return this.companyname;
    }
    public void setCompanyname(String companyname) {
        this.companyname = companyname;
    }
    public String getComtype() {
        return this.comtype;
    }
    public void setComtype(String comtype) {
        this.comtype = comtype;
    }
    public String getPublishtime() {
        return this.publishtime;
    }
    public void setPublishtime(String publishtime) {
        this.publishtime = publishtime;
    }
    public String getPlace() {
        return this.place;
    }
    public void setPlace(String place) {
        this.place = place;
    }
    public Integer getRequirecount() {
        return this.requirecount;
    }
    public void setRequirecount(Integer requirecount) {
        this.requirecount = requirecount;
    }
    public String getWorkyear() {
        return this.workyear;
    }
    public void setWorkyear(String workyear) {
        this.workyear = workyear;
    }
    public String getQualifications() {
        return this.qualifications;
    }
    public void setQualifications(String qualifications) {
        this.qualifications = qualifications;
    }
    public String getJobcontent() {
        return this.jobcontent;
    }
    public void setJobcontent(String jobcontent) {
        this.jobcontent = jobcontent;
    }
    public String getUrl() {
        return this.url;
    }
    public void setUrl(String url) {
        this.url = url;
    }
}

进行爬取,必须读取这个网站的url,然后在网站内执行搜索:

1、选取你想要爬去的内容,比如说想要爬取上海的java相关的职位,那你就要读取这个url

2、然后再点进一个具体的职位,对网页源码进行分析,找到你想要爬去的字段的一些名称,比如一些属性,元素,id,标签名,通过jsoup分析一下,怕取出里面的信息,存储到数据库里面

3、存储到数据库里面之后,我们就要解析数据库里面的字段,用分词解析器分析一下,建立字段索引,也可以把所有的字段都建立成索引(我就是这样做的)

4、输入想要的关键字,只要是字段里有这个词,就会搜索出来,并且关键字会高亮显示(高亮显示是我附加的一个功能)

好了,分析完了之后我们就实现这个操作:

首先爬取网站

建立接口

public interface CrawlService {
    public void doCrawl()throws Exception;
}

实现接口

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;
import pojo.Jobcrawl;
public class CrawlServiceImpl extends HibernateDaoSupport implements  CrawlService {
                                                                                                                 
    @Override
    public void doCrawl() throws Exception{
        HttpClient httpClient=new HttpClient();
        GetMethod getMethod=new GetMethod("http://search.51job.com/list/%2B,%2B,%2B,%2B,%2B,%2B,java,2,%2B.html?lang=c&stype=1");
        httpClient.executeMethod(getMethod);
        String html=getMethod.getResponseBodyAsString();
        html=new String(html.getBytes("iso8859-1"),"gb2312");
                                                                                                                     
        Document doc=Jsoup.parse(html);
        Element totalCountEle=doc.select("table.navBold").select("td").get(1);
        String totalCountStr=totalCountEle.text();
        totalCountStr=totalCountStr.split("/")[1].trim();
        int totalCount=Integer.parseInt(totalCountStr);
        //总页数
        int pageCount=totalCount/30;
        for(int currentPage=1;currentPage<5;currentPage++){
            GetMethod gmPerPage=new GetMethod("http://search.51job.com/jobsearch/search_result.php?curr_page="+currentPage+"&&keyword=java");
            httpClient.executeMethod(gmPerPage);
            String perPageHtml=gmPerPage.getResponseBodyAsString();
            perPageHtml=new String(perPageHtml.getBytes("iso8859-1"),"gb2312");
            Document pageDoc=Jsoup.parse(perPageHtml);
            Elements eles=pageDoc.select("a.jobname");
            for(int i=0;i<eles.size();i++){
               Element ele=eles.get(i);
               //详细信息的url
               String detailUrl=ele.attr("href");
               GetMethod detailGet=new GetMethod(detailUrl);
               httpClient.executeMethod(detailGet);
               String detailHtml=detailGet.getResponseBodyAsString();
     detailHtml=new String(detailHtml.getBytes("iso8859-1"),"gb2312");
               Document detailDoc=Jsoup.parse(detailHtml);
               //得到职位名称
               Elements detailEles=detailDoc.select("td.sr_bt");
               Element jobnameEle=detailEles.get(0);
               String jobname=jobnameEle.text(); 
               System.out.println("职位名称:"+jobname);
                                                                                                                            
               //取得公司名称
               Elements companyEles=detailDoc.select("table.jobs_1");
               Element companyEle=companyEles.get(0);
               Element companyEle_Rel=companyEle.select("a").get(0);
               String companyName=companyEle_Rel.text();
               System.out.println("公司名称:"+companyName);
                                                                                                                 
               //公司行业
               Elements comp_industry=detailDoc.select("strong:contains(公司行业)");
               String comp_industry_name="";
               if(comp_industry.size()>0){
                  Element comp_ele=comp_industry.get(0);
 TextNode comp_ele_real=(TextNode)comp_ele.nextSibling();
                  comp_industry_name=comp_ele_real.text();
                  System.out.println("公司行业:"+comp_industry_name);
               }
                                                                                                                            
               //公司性质
               Elements compTypeEles=detailDoc.select("strong:contains(公司性质)");
               String comType="";
               if(compTypeEles.size()>0){
                     Element compTypeEle=compTypeEles.get(0);
TextNode comTypeNode=(TextNode)compTypeEle.nextSibling();
                     comType=comTypeNode.text();
                     System.out.println("公司性质:"+comType);  
               }
                                                                                                                            
               //公司规模
 Elements compScaleEles=detailDoc.select("strong:contains(公司规模)");
               String comScale="";
               if(compScaleEles.size()>0){
comScale=((TextNode)compScaleEles.get(0).nextSibling()).text();
                   System.out.println("公司规模: "+comScale);
               }
               //发布日期
               Elements publishTimeEles=detailDoc.select("td:contains(发布日期)");
               Element publishTimeEle=publishTimeEles.get(0).nextElementSibling();
               String publishTime=publishTimeEle.text();
               System.out.println("发布日期:"+publishTime);
                                                                                                                           
               //工作地点
               Elements placeEles=detailDoc.select("td:contains(工作地点)");
               String place="";
               if(placeEles.size()>0){
                   place=placeEles.get(0).nextElementSibling().text();
                   System.out.println("工作地点:"+place);
               }
                                                                                                                            
Elements jobDeteilEle=detailDoc.select("td.txt_4.wordBreakNormal.job_detail");
         Elements jobDetailDivs=jobDeteilEle.get(0).select("div");
               Element jobDetailDiv=jobDetailDivs.get(0);
               String jobcontent=jobDetailDiv.html();
                                                                                                                            
               Jobcrawl job=new Jobcrawl();
               job.setJobname(jobname);
               job.setCompanyname(companyName);
               job.setIndustry(comp_industry_name);
               job.setComtype(comType);
               job.setComscale(comScale);
               job.setPublishtime(publishTime);
               job.setPlace(place);
               job.setJobcontent(jobcontent);
               this.getHibernateTemplate().save(job);
               System.out.println("===========================");  
            }
        }  
    }
}

建立索引,建立一个文件夹,把索引的东西放到里面:

建立接口,并且建立要存放索引的文件夹

public interface IndexService {
public static final String INDEXPATH="D:\\Workspaces\\Job51\\indexDir";
                                                                                  
    public void createIndex() throws Exception;
}

实现接口

import java.io.File;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;
import org.wltea.analyzer.lucene.IKAnalyzer;
import pojo.Jobcrawl;
public class IndexServiceImpl extends HibernateDaoSupport implements IndexService {
    public void createIndex() throws Exception {
                                                                                
        //索引文件夹对象
        Directory dir=FSDirectory.open(new File(IndexService.INDEXPATH));
        //中文分析器
        Analyzer analyzer=new IKAnalyzer();
        //IndexWriter的配置类
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer);
        //创建IndexWrite,它用于写索引
        IndexWriter writer = new IndexWriter(dir, iwc);
        List<Jobcrawl> list=this.getHibernateTemplate().find("from Jobcrawl");
        writer.deleteAll();
        for(Jobcrawl job:list){
            Document doc=new Document();
            Field toidField=new Field("toid",job.getToid(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(toidField);
                                                                                    
            Field jobField=new Field("jobname",job.getJobname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(jobField);
                                                                                    
            Field companyField=new Field("companyname",job.getCompanyname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(companyField);
                                                                                    
            Field placeField=new Field("place",job.getPlace(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(placeField);
                                                                                    
            Field publishTimeField=new Field("publishTime",job.getPublishtime(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(publishTimeField);
                                                                                    
            //把所有的字段都加入索引
            String content=job.getJobname()+job.getComtype()+job.getIndustry()+job.getPlace()+job.getWorkyear()+job.getJobcontent();
            Field contentField=new Field("content",content,Field.Store.NO,Field.Index.ANALYZED);
            doc.add(contentField);
            writer.addDocument(doc);
        }
        writer.close();
    }
}

实行搜索,建立接口:

import java.util.List;
import pojo.Jobcrawl;
public interface SearchService {
    public List<Jobcrawl> searchJob(String keyword)throws Exception;
}

实现接口:

import java.io.File;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;
import pojo.Jobcrawl;
public class SearchServiceImpl implements SearchService {
    @Override
    public List<Jobcrawl> searchJob(String keyword)throws Exception {
        IndexSearcher searcher = new IndexSearcher(FSDirectory
                .open(new File(IndexService.INDEXPATH)));
        // 词汇中文分析器,建立索引的分词器必须和查询的分词器一致
        Analyzer analyzer = new IKAnalyzer();
        //创建查询解析对象
        QueryParser parser = new QueryParser(Version.LUCENE_34,"content",
                analyzer);
        //创建查询对象,传入要查询的关键词
        Query query = parser.parse(keyword);
        TopDocs top_docs=searcher.search(query,20);
        ScoreDoc[] docs=top_docs.scoreDocs;
        //高亮显示
        SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");
        Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
        highlighter.setTextFragmenter(new SimpleFragmenter(1024));
        List<Jobcrawl> list=new ArrayList<Jobcrawl>();
        for(ScoreDoc sd:docs){
            Document pojoDoc=searcher.doc(sd.doc);
            Jobcrawl job=new Jobcrawl();
            job.setToid(pojoDoc.get("toid"));
                                                                              
            String jobname=pojoDoc.get("jobname");
            TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname));   
            String jobname_high = highlighter.getBestFragment(tokenStream,jobname);
            if(jobname_high!=null){
                jobname=jobname_high;
            }
            job.setJobname(jobname);
                                                                              
            String companyname=pojoDoc.get("companyname");
            tokenStream = analyzer.tokenStream("companyname",new StringReader(companyname));
            String companyname_high=highlighter.getBestFragment(tokenStream,companyname);
            if(companyname_high!=null){
                companyname=companyname_high;
            }
            job.setCompanyname(companyname);
                                                                              
            String place=pojoDoc.get("place");
            tokenStream = analyzer.tokenStream("place",new StringReader(place));
            String place_high=highlighter.getBestFragment(tokenStream,place);
            if(place_high!=null){
                place=place_high;
            }
            job.setPlace(place);
                                                                            
            job.setPublishtime(pojoDoc.get("publishTime"));
            list.add(job);
        }
        return list;
    }
    public static void main(String[] args) throws Exception{
        String keyword="android";
        IndexSearcher searcher = new IndexSearcher(FSDirectory
                .open(new File(IndexService.INDEXPATH)));
        // 词汇中文分析器,建立索引的分词器必须和查询的分词器一致
        Analyzer analyzer = new IKAnalyzer();
        //创建查询解析对象
        QueryParser parser = new QueryParser(Version.LUCENE_34,"content",
                analyzer);
        //创建查询对象,传入要查询的关键词
        Query query = parser.parse(keyword);
        TopDocs top_docs=searcher.search(query,20);
        ScoreDoc[] docs=top_docs.scoreDocs;
                                                                          
        //高亮显示
        SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");
        Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
        highlighter.setTextFragmenter(new SimpleFragmenter(1024));
        for(ScoreDoc sd:docs){
            Document pojoDoc=searcher.doc(sd.doc);
            String jobname=pojoDoc.get("jobname");
            TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname));   
            String highLightText = highlighter.getBestFragment(tokenStream,jobname);
            System.out.println(highLightText);
        }       
    }
}

抓取简历和建立索引

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>后台管理</title>
                                    
    <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script>
    <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script>
     <link rel="stylesheet" media="screen"
            href="<%=path%>/bootstrap/css/bootstrap.min.css">
        <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
                                  
  </head>
                                  
  <body>
  <div class="container">
    <div class="row">
      <div class="span4">
         <ul class="nav nav-pills">
         <li ><a href="<%=path%>/index/userAction!doCrawl.action"> 抓取简历</a></li>
         <li ><a href="<%=path%>/index/userAction!doIndex.action">建立索引</a></li>
         <li class="active"><a href="#">修改账号</a></li>
         </ul>
      </div>
    </div>
  </div>
                                   
  </body>
</html>

下面写个搜索框

!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title>欢迎来到搜索引擎</title>
    <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script>
    <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script>
     <link rel="stylesheet" media="screen"
            href="<%=path%>/bootstrap/css/bootstrap.min.css">
        <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
  </head>
                                                   
  <body>
    <div class="container">
      <div class="row">
        <div class="span2 offset10"><a href="<%=path%>/account/accountAction!toRegister.action">注册</a>,<a href="/account/accountAction!toLogin.action">登录</a></div>
      </div>
     <form action="<%=path%>/web/userAction!searchJob.action" class="form-search">
     <div class="row" style="margin-top:200px;">
        <div class="span3">&nbsp;</div>
        <div class="span4">
                                                         
         <input type="text" name="keyword" class="input-max search-query input-block-level">
                                                         
        </div>
        <div class="span2" style="margin-left:10px;"><button type="submit" class="btn">Search</button></div>
        <div class="span3">&nbsp;</div>
      </div>
      </form>
    </div>
    </body>
</html>

职位展示:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c"%>
<%
String path = request.getContextPath();
%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>查询职位</title>
    <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script>
    <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script>
     <link rel="stylesheet" media="screen"
            href="<%=path%>/bootstrap/css/bootstrap.min.css">
        <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
  </head>
                                           
  <body>
    <div class="container">
      <div class="row" style="margin-top:10px;">
        <div class="offset2 span4">
          <form action="<%=path%>/web/userAction!searchJob.action" class="form-search">
          <div>
          <input type="text" Style="height:30px" name="keyword" value="${param.keyword}" class="span2 search-query">
          <button type="submit" class="btn">Search</button>
          </div>
          </form>
        </div>
      </div>
      <div class="row">
        <div class="offset2 span12">
         <table class="table table-bordered table-striped">
         <tr>
            <td>职位名称</td>
            <td>公司名称</td>
            <td>工作地点</td>
            <td>更新日</td>
         </tr>
         <c:forEach var="job" items="${requestScope.results}" >
         <tr>
            <td><a target="_blank" href="<%=path%>/web/userAction!searchJobDetail.action?toid=${job.toid}">${job.jobname}</a></td>
            <td>${job.companyname}</td>
            <td>${job.place}</td>
            <td>${job.publishtime}</td>
         </tr>
         </c:forEach>
         </table>
        </div>
      </div>
    </div>
  </body>
</html>

最上面的第二个脚本,别忘记加上,用的c标签。

上面使用的是三大框架做的,其余的增删改查操作需要自己写。

1.建立索引

wKioL1MBlR3x-q_ZAABB8ujj1po033.jpg

2.执行搜索

wKioL1MBlR2SVInwAAA2mU8YaAo885.jpg

3.搜索内容

wKiom1MBlUOB5IqwAAT3S77sTRQ377.jpg


你可能感兴趣的:(爬取网站的内容并建立索引,执行搜索功能)