下面以爬取51Job为例
爬取和搜索也是现在比较流行的一个功能,想写这么一个小功能。希望以后对你们有帮助。
首先我们要爬取数据必须要建立相应的数据库,我写一个pojo,然后你们可以根据它建立表,数据库使用的MySql,下面都以爬取51Job为例。
写一个pojo并且建立构造方法初始化:
public class Jobcrawl implements java.io.Serializable { //主键 private String toid; //工作名称 private String jobname; //公司名称 private String companyname; //公司性质 private String comtype; //工作发布时间 private String publishtime; //工作地点 private String place; //职位要求 private Integer requirecount; //工作年限 private String workyear; //学历 private String qualifications; //工作内容 private String jobcontent; //公司详细信息网址(在51job上的) private String url; //公司行业 private String industry; //公司规模 private String comscale; // Constructors public String getComscale() { return comscale; } public void setComscale(String comscale) { this.comscale = comscale; } public String getIndustry() { return industry; } public void setIndustry(String industry) { this.industry = industry; } /** default constructor */ public Jobcrawl() { } /** minimal constructor */ public Jobcrawl(String toid) { this.toid = toid; } /** full constructor */ public Jobcrawl(String toid, String jobname, String companyname,String comtype, String publishtime, String place, Integer requirecount, String workyear, String qualifications, String jobcontent, String url) { this.toid = toid; this.jobname = jobname; this.companyname = companyname; this.comtype = comtype; this.publishtime = publishtime; this.place = place; this.requirecount = requirecount; this.workyear = workyear; this.qualifications = qualifications; this.jobcontent = jobcontent; this.url = url; } public String getToid() { return this.toid; } public void setToid(String toid) { this.toid = toid; } public String getJobname() { return this.jobname; } public void setJobname(String jobname) { this.jobname = jobname; } public String getCompanyname() { return this.companyname; } public void setCompanyname(String companyname) { this.companyname = companyname; } public String getComtype() { return this.comtype; } public void setComtype(String comtype) { this.comtype = comtype; } public String getPublishtime() { return this.publishtime; } public void setPublishtime(String publishtime) { this.publishtime = publishtime; } public String getPlace() { return this.place; } public void setPlace(String place) { this.place = place; } public Integer getRequirecount() { return this.requirecount; } public void setRequirecount(Integer requirecount) { this.requirecount = requirecount; } public String getWorkyear() { return this.workyear; } public void setWorkyear(String workyear) { this.workyear = workyear; } public String getQualifications() { return this.qualifications; } public void setQualifications(String qualifications) { this.qualifications = qualifications; } public String getJobcontent() { return this.jobcontent; } public void setJobcontent(String jobcontent) { this.jobcontent = jobcontent; } public String getUrl() { return this.url; } public void setUrl(String url) { this.url = url; } }
进行爬取,必须读取这个网站的url,然后在网站内执行搜索:
1、选取你想要爬去的内容,比如说想要爬取上海的java相关的职位,那你就要读取这个url
2、然后再点进一个具体的职位,对网页源码进行分析,找到你想要爬去的字段的一些名称,比如一些属性,元素,id,标签名,通过jsoup分析一下,怕取出里面的信息,存储到数据库里面
3、存储到数据库里面之后,我们就要解析数据库里面的字段,用分词解析器分析一下,建立字段索引,也可以把所有的字段都建立成索引(我就是这样做的)
4、输入想要的关键字,只要是字段里有这个词,就会搜索出来,并且关键字会高亮显示(高亮显示是我附加的一个功能)
好了,分析完了之后我们就实现这个操作:
首先爬取网站
建立接口
public interface CrawlService { public void doCrawl()throws Exception; }
实现接口
import org.apache.commons.httpclient.HttpClient; import org.apache.commons.httpclient.methods.GetMethod; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.nodes.TextNode; import org.jsoup.select.Elements; import org.springframework.orm.hibernate3.support.HibernateDaoSupport; import pojo.Jobcrawl; public class CrawlServiceImpl extends HibernateDaoSupport implements CrawlService { @Override public void doCrawl() throws Exception{ HttpClient httpClient=new HttpClient(); GetMethod getMethod=new GetMethod("http://search.51job.com/list/%2B,%2B,%2B,%2B,%2B,%2B,java,2,%2B.html?lang=c&stype=1"); httpClient.executeMethod(getMethod); String html=getMethod.getResponseBodyAsString(); html=new String(html.getBytes("iso8859-1"),"gb2312"); Document doc=Jsoup.parse(html); Element totalCountEle=doc.select("table.navBold").select("td").get(1); String totalCountStr=totalCountEle.text(); totalCountStr=totalCountStr.split("/")[1].trim(); int totalCount=Integer.parseInt(totalCountStr); //总页数 int pageCount=totalCount/30; for(int currentPage=1;currentPage<5;currentPage++){ GetMethod gmPerPage=new GetMethod("http://search.51job.com/jobsearch/search_result.php?curr_page="+currentPage+"&&keyword=java"); httpClient.executeMethod(gmPerPage); String perPageHtml=gmPerPage.getResponseBodyAsString(); perPageHtml=new String(perPageHtml.getBytes("iso8859-1"),"gb2312"); Document pageDoc=Jsoup.parse(perPageHtml); Elements eles=pageDoc.select("a.jobname"); for(int i=0;i<eles.size();i++){ Element ele=eles.get(i); //详细信息的url String detailUrl=ele.attr("href"); GetMethod detailGet=new GetMethod(detailUrl); httpClient.executeMethod(detailGet); String detailHtml=detailGet.getResponseBodyAsString(); detailHtml=new String(detailHtml.getBytes("iso8859-1"),"gb2312"); Document detailDoc=Jsoup.parse(detailHtml); //得到职位名称 Elements detailEles=detailDoc.select("td.sr_bt"); Element jobnameEle=detailEles.get(0); String jobname=jobnameEle.text(); System.out.println("职位名称:"+jobname); //取得公司名称 Elements companyEles=detailDoc.select("table.jobs_1"); Element companyEle=companyEles.get(0); Element companyEle_Rel=companyEle.select("a").get(0); String companyName=companyEle_Rel.text(); System.out.println("公司名称:"+companyName); //公司行业 Elements comp_industry=detailDoc.select("strong:contains(公司行业)"); String comp_industry_name=""; if(comp_industry.size()>0){ Element comp_ele=comp_industry.get(0); TextNode comp_ele_real=(TextNode)comp_ele.nextSibling(); comp_industry_name=comp_ele_real.text(); System.out.println("公司行业:"+comp_industry_name); } //公司性质 Elements compTypeEles=detailDoc.select("strong:contains(公司性质)"); String comType=""; if(compTypeEles.size()>0){ Element compTypeEle=compTypeEles.get(0); TextNode comTypeNode=(TextNode)compTypeEle.nextSibling(); comType=comTypeNode.text(); System.out.println("公司性质:"+comType); } //公司规模 Elements compScaleEles=detailDoc.select("strong:contains(公司规模)"); String comScale=""; if(compScaleEles.size()>0){ comScale=((TextNode)compScaleEles.get(0).nextSibling()).text(); System.out.println("公司规模: "+comScale); } //发布日期 Elements publishTimeEles=detailDoc.select("td:contains(发布日期)"); Element publishTimeEle=publishTimeEles.get(0).nextElementSibling(); String publishTime=publishTimeEle.text(); System.out.println("发布日期:"+publishTime); //工作地点 Elements placeEles=detailDoc.select("td:contains(工作地点)"); String place=""; if(placeEles.size()>0){ place=placeEles.get(0).nextElementSibling().text(); System.out.println("工作地点:"+place); } Elements jobDeteilEle=detailDoc.select("td.txt_4.wordBreakNormal.job_detail"); Elements jobDetailDivs=jobDeteilEle.get(0).select("div"); Element jobDetailDiv=jobDetailDivs.get(0); String jobcontent=jobDetailDiv.html(); Jobcrawl job=new Jobcrawl(); job.setJobname(jobname); job.setCompanyname(companyName); job.setIndustry(comp_industry_name); job.setComtype(comType); job.setComscale(comScale); job.setPublishtime(publishTime); job.setPlace(place); job.setJobcontent(jobcontent); this.getHibernateTemplate().save(job); System.out.println("==========================="); } } } }
建立索引,建立一个文件夹,把索引的东西放到里面:
建立接口,并且建立要存放索引的文件夹
public interface IndexService { public static final String INDEXPATH="D:\\Workspaces\\Job51\\indexDir"; public void createIndex() throws Exception; }
实现接口
import java.io.File; import java.util.List; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import org.springframework.orm.hibernate3.support.HibernateDaoSupport; import org.wltea.analyzer.lucene.IKAnalyzer; import pojo.Jobcrawl; public class IndexServiceImpl extends HibernateDaoSupport implements IndexService { public void createIndex() throws Exception { //索引文件夹对象 Directory dir=FSDirectory.open(new File(IndexService.INDEXPATH)); //中文分析器 Analyzer analyzer=new IKAnalyzer(); //IndexWriter的配置类 IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer); //创建IndexWrite,它用于写索引 IndexWriter writer = new IndexWriter(dir, iwc); List<Jobcrawl> list=this.getHibernateTemplate().find("from Jobcrawl"); writer.deleteAll(); for(Jobcrawl job:list){ Document doc=new Document(); Field toidField=new Field("toid",job.getToid(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(toidField); Field jobField=new Field("jobname",job.getJobname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(jobField); Field companyField=new Field("companyname",job.getCompanyname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(companyField); Field placeField=new Field("place",job.getPlace(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(placeField); Field publishTimeField=new Field("publishTime",job.getPublishtime(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS); doc.add(publishTimeField); //把所有的字段都加入索引 String content=job.getJobname()+job.getComtype()+job.getIndustry()+job.getPlace()+job.getWorkyear()+job.getJobcontent(); Field contentField=new Field("content",content,Field.Store.NO,Field.Index.ANALYZED); doc.add(contentField); writer.addDocument(doc); } writer.close(); } }
实行搜索,建立接口:
import java.util.List; import pojo.Jobcrawl; public interface SearchService { public List<Jobcrawl> searchJob(String keyword)throws Exception; }
实现接口:
import java.io.File; import java.io.StringReader; import java.util.ArrayList; import java.util.List; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleFragmenter; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import org.wltea.analyzer.lucene.IKAnalyzer; import pojo.Jobcrawl; public class SearchServiceImpl implements SearchService { @Override public List<Jobcrawl> searchJob(String keyword)throws Exception { IndexSearcher searcher = new IndexSearcher(FSDirectory .open(new File(IndexService.INDEXPATH))); // 词汇中文分析器,建立索引的分词器必须和查询的分词器一致 Analyzer analyzer = new IKAnalyzer(); //创建查询解析对象 QueryParser parser = new QueryParser(Version.LUCENE_34,"content", analyzer); //创建查询对象,传入要查询的关键词 Query query = parser.parse(keyword); TopDocs top_docs=searcher.search(query,20); ScoreDoc[] docs=top_docs.scoreDocs; //高亮显示 SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>"); Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(1024)); List<Jobcrawl> list=new ArrayList<Jobcrawl>(); for(ScoreDoc sd:docs){ Document pojoDoc=searcher.doc(sd.doc); Jobcrawl job=new Jobcrawl(); job.setToid(pojoDoc.get("toid")); String jobname=pojoDoc.get("jobname"); TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname)); String jobname_high = highlighter.getBestFragment(tokenStream,jobname); if(jobname_high!=null){ jobname=jobname_high; } job.setJobname(jobname); String companyname=pojoDoc.get("companyname"); tokenStream = analyzer.tokenStream("companyname",new StringReader(companyname)); String companyname_high=highlighter.getBestFragment(tokenStream,companyname); if(companyname_high!=null){ companyname=companyname_high; } job.setCompanyname(companyname); String place=pojoDoc.get("place"); tokenStream = analyzer.tokenStream("place",new StringReader(place)); String place_high=highlighter.getBestFragment(tokenStream,place); if(place_high!=null){ place=place_high; } job.setPlace(place); job.setPublishtime(pojoDoc.get("publishTime")); list.add(job); } return list; } public static void main(String[] args) throws Exception{ String keyword="android"; IndexSearcher searcher = new IndexSearcher(FSDirectory .open(new File(IndexService.INDEXPATH))); // 词汇中文分析器,建立索引的分词器必须和查询的分词器一致 Analyzer analyzer = new IKAnalyzer(); //创建查询解析对象 QueryParser parser = new QueryParser(Version.LUCENE_34,"content", analyzer); //创建查询对象,传入要查询的关键词 Query query = parser.parse(keyword); TopDocs top_docs=searcher.search(query,20); ScoreDoc[] docs=top_docs.scoreDocs; //高亮显示 SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>"); Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(1024)); for(ScoreDoc sd:docs){ Document pojoDoc=searcher.doc(sd.doc); String jobname=pojoDoc.get("jobname"); TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname)); String highLightText = highlighter.getBestFragment(tokenStream,jobname); System.out.println(highLightText); } } }
抓取简历和建立索引
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>后台管理</title> <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script> <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script> <link rel="stylesheet" media="screen" href="<%=path%>/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css"> </head> <body> <div class="container"> <div class="row"> <div class="span4"> <ul class="nav nav-pills"> <li ><a href="<%=path%>/index/userAction!doCrawl.action"> 抓取简历</a></li> <li ><a href="<%=path%>/index/userAction!doIndex.action">建立索引</a></li> <li class="active"><a href="#">修改账号</a></li> </ul> </div> </div> </div> </body> </html>
下面写个搜索框
!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>欢迎来到搜索引擎</title> <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script> <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script> <link rel="stylesheet" media="screen" href="<%=path%>/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css"> </head> <body> <div class="container"> <div class="row"> <div class="span2 offset10"><a href="<%=path%>/account/accountAction!toRegister.action">注册</a>,<a href="/account/accountAction!toLogin.action">登录</a></div> </div> <form action="<%=path%>/web/userAction!searchJob.action" class="form-search"> <div class="row" style="margin-top:200px;"> <div class="span3"> </div> <div class="span4"> <input type="text" name="keyword" class="input-max search-query input-block-level"> </div> <div class="span2" style="margin-left:10px;"><button type="submit" class="btn">Search</button></div> <div class="span3"> </div> </div> </form> </div> </body> </html>
职位展示:
<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%> <%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c"%> <% String path = request.getContextPath(); %> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>查询职位</title> <script type="text/javascript" src=\'#\'" /js/jquery-1.7.1.min.js"></script> <script type="text/javascript" src=\'#\'" /bootstrap/js/bootstrap.min.js"></script> <link rel="stylesheet" media="screen" href="<%=path%>/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css"> </head> <body> <div class="container"> <div class="row" style="margin-top:10px;"> <div class="offset2 span4"> <form action="<%=path%>/web/userAction!searchJob.action" class="form-search"> <div> <input type="text" Style="height:30px" name="keyword" value="${param.keyword}" class="span2 search-query"> <button type="submit" class="btn">Search</button> </div> </form> </div> </div> <div class="row"> <div class="offset2 span12"> <table class="table table-bordered table-striped"> <tr> <td>职位名称</td> <td>公司名称</td> <td>工作地点</td> <td>更新日</td> </tr> <c:forEach var="job" items="${requestScope.results}" > <tr> <td><a target="_blank" href="<%=path%>/web/userAction!searchJobDetail.action?toid=${job.toid}">${job.jobname}</a></td> <td>${job.companyname}</td> <td>${job.place}</td> <td>${job.publishtime}</td> </tr> </c:forEach> </table> </div> </div> </div> </body> </html>
最上面的第二个脚本,别忘记加上,用的c标签。
上面使用的是三大框架做的,其余的增删改查操作需要自己写。
1.建立索引
2.执行搜索
3.搜索内容