本文摘自:http://jnotnull.iteye.com/blog/275327
目标:创建一个具有高度可移植的,定时创建索引的站内搜索。
途径:dic和index都放到程序中去。
准备:
1 Lucene
Lucene Java(以下简称Lucene)目前可用版本是2.4.0,关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html。
2 Paoding
Qieqie同学的伟大作品、优秀的Lucene中文分词组件,目前的版本为paoding-analysis-2.0.4-beta,对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/。
3 下载最新的paoding-analysis-2.0.4-beta版本(里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar)。
开始工作:
1 试运行
打开下载包中的examples文件夹,运行一下吧(注意一下编码)。
2 集成到SSH2系统中去 (系统结构Action->service->dao)
1) 由于SSH2系统是web系统,因此在配置Paoding上就有可能和第一步有些不同。
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件,修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件,修改paoding.dic.home=classpath:dic,使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了,但是如果你是制定为classpath:dic,则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法,修改File dicHomeFile = getFile(dicHome);为
File dicHomeFile2 = getFile(dicHome); String path=""; try { path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } File dicHomeFile = new File(path);
目的是解码,不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。
2)表结构
CREATE TABLE `news` ( `id` int(11) NOT NULL auto_increment, `title` varchar(255) default NULL, `details` mediumtext, `author` varchar(255) default NULL, `publisher` varchar(100) default NULL, `clicks` int(11) default NULL, `source` varchar(255) default NULL, `addtime` datetime default NULL, ` category ` varchar(100) default NULL, `keywords` varchar(255) default NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=gbk;
3 正式实施编码
编写站内搜索分为两步:创建索引和进行搜索,所需类:SearchAction.java和TaskAction.java(同一目录)
1) 创建索引
主要任务:从已有的txt文件中读取上一次进行索引的最后一条新闻的id号,然后从业务逻辑中查找大于这个id号的所有新闻进行索引,最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。
新建TaskAction,增加如下方法
public void createIndex() { String path; try { //两个参数:创建索引的位置 和 上一次创建索引最后的新闻id所在文件 createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt"); } catch (Exception e) { e.printStackTrace(); } } public String getPath(Class clazz, String textName) throws IOException { String path = (URLDecoder.decode( clazz.getResource(textName).toString(), "UTF-8")).substring(6); return path; } public void createNewsIndex(String path,String textName) throws Exception { String newsId = "0"; newsId = readText(TaskAction.class, textName); if (null ==newsId || "".equals(newsId)) newsId = "0"; // 使用paoding中文分析器 Analyzer analyzer = new PaodingAnalyzer(); FSDirectory directory = FSDirectory.getDirectory(path); System.out.println(directory.toString()); IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName)); Document doc = new Document(); // 从业务逻辑层读取大于当前id的信息 List list = newsManageService.getNewsBigId(Integer.parseInt(newsId)); Iterator iterator = list.iterator(); News news = new News(); while (iterator.hasNext()) { doc = new Document(); news = (News) iterator.next(); doc.add(new Field("id", "" + news.getId(), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("details", "" + Constants.splitAndFilterString(news.getDetails()), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("addtime", "" + news.getAddtime(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("keywords", "" + news.getKeywords(), Field.Store.YES, Field.Index.TOKENIZED)); System.out.println("Indexing file " + news.getName() + "..."); articleId = String.valueOf(news.getId()); try { writer.addDocument(doc); } catch (IOException e) { e.printStackTrace(); } } // 优化并关闭 writer.optimize(); writer.close(); // 将我索引的最后一篇文章的id写入文件 String content = WriteText(TaskAction.class, textName, newsId); } public boolean isEmpty(Class clazz, String textName) throws Exception { String articleId = "0"; boolean isEmpty = true; articleId = ContentReader.readText(clazz, textName); if (null == articleId || "".equals(articleId)) articleId = "0"; if (!articleId.equals("0")) isEmpty = false; System.out.println(clazz.getName()+" "+isEmpty); return isEmpty; } //该方法参考了paoding中example中的一个方法。 public String readText(Class clazz, String textName) throws IOException { InputStream in = clazz.getResourceAsStream(textName); Reader re = new InputStreamReader(in, "UTF-8"); char[] chs = new char[1024]; int count; String content = ""; while ((count = re.read(chs)) != -1) { content = content + new String(chs, 0, count); } return content; } public String WriteText(Class clazz, String textName, String text) throws IOException { String path = (URLDecoder.decode( clazz.getResource(textName).toString(), "UTF-8")).substring(6); System.out.println(path); File file = new File(path); BufferedWriter bw = new BufferedWriter(new FileWriter(file)); String temp = text; bw.write(temp); bw.close(); return temp; }
2)进行搜索
public void searchIndex(String path, String keywords) throws Exception { String[] FIELD = { "title", "details" }; String QUERY = keywords; Analyzer analyzer = new PaodingAnalyzer(); FSDirectory directory = FSDirectory.getDirectory(path); IndexReader reader = IndexReader.open(directory); String queryString = QUERY; BooleanClause.Occur[] flags = new BooleanClause.Occur[] { BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD }; Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags, analyzer); Searcher searcher = new IndexSearcher(directory); query = query.rewrite(reader); System.out.println("Searching for: " + query.toString()); Hits hits = searcher.search(query); NewsDTO news = new NewsDTO(); String highLightText = ""; for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); String title1 = doc.get("title"); String contents1 = doc.get("details"); SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter( "", ""); Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(200)); if (contents1 != null) { TokenStream tokenStream = analyzer.tokenStream("details", new StringReader(contents1)); highLightText = highlighter.getBestFragment(tokenStream, contents1); } news = new NewsDTO(); news.setId(Integer.parseInt(doc.get("id"))); news.setName(doc.get("title")); news.setDetails(highLightText); news.setAddtime(doc.get("addtime")); news.setAuthor(doc.get("author")); searchResultItem.add(news); } reader.close(); }
核心代码已经基本完成了,还有一个加亮显示,非常不错的哦。
3)再来一个定时创建索引:
定义一下bean
<bean id="myTask" class="edu.cumt.jnotnull.action.TaskAction"> <property name="newsManageService"> <ref bean="newsManageService" /> </property> </bean> <bean id="entity" class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean"> <property name="targetObject"> <ref local="myTask" /> </property> <property name="targetMethod"> <value>createIndex</value> </property> </bean> <bean id="cron" class="org.springframework.scheduling.quartz.CronTriggerBean"> <property name="jobDetail"> <ref bean="entity" /> </property> <property name="cronExpression"> <value>0 0-5 2 * * ?</value> </property> </bean> <bean autowire="no" class="org.springframework.scheduling.quartz.SchedulerFactoryBean"> <property name="triggers"> <list> <ref local="cron" /> </list> </property> </bean>这样就可以在夜里面让他自动促发了。
相关讨论:
权限管理问题:创建索引应该是管理员才可以调用的。如何在定时执行下进行访问控制呢。