使用Lucene+Paoding构建SSH2系统的站内搜索

 

本文摘自:http://jnotnull.iteye.com/blog/275327

 

 

目标:创建一个具有高度可移植的,定时创建索引的站内搜索。 
途径:dic和index都放到程序中去。 

准备: 
1   Lucene 
Lucene Java(以下简称Lucene)目前可用版本是2.4.0,关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html。 


2 Paoding 
Qieqie同学的伟大作品、优秀的Lucene中文分词组件,目前的版本为paoding-analysis-2.0.4-beta,对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/。 


3 下载最新的paoding-analysis-2.0.4-beta版本(里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar)。 

 

开始工作: 
   1 试运行 
打开下载包中的examples文件夹,运行一下吧(注意一下编码)。 


   2 集成到SSH2系统中去 (系统结构Action->service->dao) 
1)  由于SSH2系统是web系统,因此在配置Paoding上就有可能和第一步有些不同。 
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件,修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件,修改paoding.dic.home=classpath:dic,使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了,但是如果你是制定为classpath:dic,则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法,修改File dicHomeFile = getFile(dicHome);为 

File dicHomeFile2 = getFile(dicHome);  
        String path="";  
        try {  
            path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8");  
        } catch (UnsupportedEncodingException e) {  
            e.printStackTrace();  
        }  
    File dicHomeFile = new File(path);  

 

目的是解码,不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。 

2)表结构 

CREATE TABLE `news` (  
  `id` int(11) NOT NULL auto_increment,  
  `title` varchar(255) default NULL,  
  `details` mediumtext,  
  `author` varchar(255) default NULL,  
  `publisher` varchar(100) default NULL,  
  `clicks` int(11) default NULL,  
  `source` varchar(255) default NULL,  
  `addtime` datetime default NULL,  
  ` category ` varchar(100) default NULL,  
  `keywords` varchar(255) default NULL,  
  PRIMARY KEY  (`id`)  
) ENGINE=InnoDB DEFAULT CHARSET=gbk;  

 

  3 正式实施编码 
       编写站内搜索分为两步:创建索引和进行搜索,所需类:SearchAction.java和TaskAction.java(同一目录) 
1) 创建索引 
主要任务:从已有的txt文件中读取上一次进行索引的最后一条新闻的id号,然后从业务逻辑中查找大于这个id号的所有新闻进行索引,最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。 
新建TaskAction,增加如下方法 

public void createIndex() {  
        String path;  
        try {     
//两个参数:创建索引的位置  和 上一次创建索引最后的新闻id所在文件  
    createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt");  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
    }  
  
public String getPath(Class clazz, String textName)  
            throws IOException {  
        String path = (URLDecoder.decode(  
                clazz.getResource(textName).toString(), "UTF-8")).substring(6);       
        return path;  
    }  
  
public void createNewsIndex(String path,String textName) throws Exception {  
        String newsId = "0";  
          
        newsId = readText(TaskAction.class, textName);  
        if (null ==newsId || "".equals(newsId))  
            newsId = "0";  
  
        // 使用paoding中文分析器  
        Analyzer analyzer = new PaodingAnalyzer();  
        FSDirectory directory = FSDirectory.getDirectory(path);  
        System.out.println(directory.toString());  
        IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName));  
        Document doc = new Document();  
  
        // 从业务逻辑层读取大于当前id的信息  
        List list = newsManageService.getNewsBigId(Integer.parseInt(newsId));  
        Iterator iterator = list.iterator();  
        News news = new News();  
        while (iterator.hasNext()) {  
            doc = new Document();  
            news = (News) iterator.next();  
            doc.add(new Field("id", "" + news.getId(), Field.Store.YES,  
                    Field.Index.UN_TOKENIZED));  
            doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES,  
                    Field.Index.TOKENIZED));  
            doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES,  
                    Field.Index.TOKENIZED));  
            doc.add(new Field("details", ""  
                    + Constants.splitAndFilterString(news.getDetails()),  
                    Field.Store.YES, Field.Index.TOKENIZED,  
                    Field.TermVector.WITH_POSITIONS_OFFSETS));  
            doc.add(new Field("addtime", "" + news.getAddtime(),  
                    Field.Store.YES, Field.Index.TOKENIZED));  
            doc.add(new Field("keywords", "" + news.getKeywords(),  
                    Field.Store.YES, Field.Index.TOKENIZED));  
            System.out.println("Indexing file " + news.getName() + "...");  
            articleId = String.valueOf(news.getId());  
            try {  
                writer.addDocument(doc);  
            } catch (IOException e) {  
                e.printStackTrace();  
            }  
        }  
        // 优化并关闭  
        writer.optimize();  
        writer.close();  
  
        // 将我索引的最后一篇文章的id写入文件  
        String content = WriteText(TaskAction.class,  
                textName, newsId);  
    }     
  
public boolean isEmpty(Class clazz, String textName) throws Exception {  
        String articleId = "0";  
        boolean isEmpty = true;  
        articleId = ContentReader.readText(clazz, textName);  
        if (null == articleId || "".equals(articleId))  
            articleId = "0";  
        if (!articleId.equals("0"))  
            isEmpty = false;  
        System.out.println(clazz.getName()+" "+isEmpty);  
        return isEmpty;  
    }  
  
//该方法参考了paoding中example中的一个方法。  
public String readText(Class clazz, String textName)  
            throws IOException {  
        InputStream in = clazz.getResourceAsStream(textName);  
        Reader re = new InputStreamReader(in, "UTF-8");  
        char[] chs = new char[1024];  
        int count;  
        String content = "";  
        while ((count = re.read(chs)) != -1) {  
            content = content + new String(chs, 0, count);  
        }  
        return content;  
    }  
  
public String WriteText(Class clazz, String textName, String text)  
            throws IOException {  
        String path = (URLDecoder.decode(  
                clazz.getResource(textName).toString(), "UTF-8")).substring(6);  
        System.out.println(path);  
        File file = new File(path);  
        BufferedWriter bw = new BufferedWriter(new FileWriter(file));  
        String temp = text;  
        bw.write(temp);  
        bw.close();  
        return temp;  
    }

 

2)进行搜索

public void searchIndex(String path, String keywords) throws Exception {  
        String[] FIELD = { "title", "details" };  
        String QUERY = keywords;  
  
        Analyzer analyzer = new PaodingAnalyzer();  
        FSDirectory directory = FSDirectory.getDirectory(path);  
        IndexReader reader = IndexReader.open(directory);  
        String queryString = QUERY;  
        BooleanClause.Occur[] flags = new BooleanClause.Occur[] {  
                BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };  
        Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags,  
                analyzer);  
  
        Searcher searcher = new IndexSearcher(directory);  
        query = query.rewrite(reader);  
        System.out.println("Searching for: " + query.toString());  
        Hits hits = searcher.search(query);  
  
        NewsDTO news = new NewsDTO();  
        String highLightText = "";  
  
        for (int i = 0; i < hits.length(); i++) {  
  
            Document doc = hits.doc(i);  
            String title1 = doc.get("title");  
            String contents1 = doc.get("details");  
  
            SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter(  
                    "", "");  
  
            Highlighter highlighter = new Highlighter(simpleHTMLFormatter,  
                    new QueryScorer(query));  
            highlighter.setTextFragmenter(new SimpleFragmenter(200));  
  
            if (contents1 != null) {  
                TokenStream tokenStream = analyzer.tokenStream("details",  
                        new StringReader(contents1));  
                highLightText = highlighter.getBestFragment(tokenStream,  
                        contents1);  
            }  
            news = new NewsDTO();  
            news.setId(Integer.parseInt(doc.get("id")));  
            news.setName(doc.get("title"));  
            news.setDetails(highLightText);  
            news.setAddtime(doc.get("addtime"));  
            news.setAuthor(doc.get("author"));  
            searchResultItem.add(news);  
        }  
        reader.close();  
  
    }  

 

   核心代码已经基本完成了,还有一个加亮显示,非常不错的哦。 

3)再来一个定时创建索引: 
   定义一下bean 

          
<bean id="myTask" class="edu.cumt.jnotnull.action.TaskAction">  
        <property name="newsManageService">  
            <ref bean="newsManageService" />  
        </property>  
    </bean>  
  
    <bean id="entity"  
        class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean">  
        <property name="targetObject">  
            <ref local="myTask" />  
        </property>  
        <property name="targetMethod">  
            <value>createIndex</value>  
        </property>  
    </bean>  
  
    <bean id="cron"  
        class="org.springframework.scheduling.quartz.CronTriggerBean">  
        <property name="jobDetail">  
            <ref bean="entity" />  
        </property>  
        <property name="cronExpression">  
            <value>0 0-5 2 * * ?</value>  
        </property>  
    </bean>  
  
    <bean autowire="no"  
        class="org.springframework.scheduling.quartz.SchedulerFactoryBean">  
        <property name="triggers">  
            <list>  
                <ref local="cron" />  
            </list>  
        </property>  
    </bean>  
  这样就可以在夜里面让他自动促发了。 



相关讨论: 
权限管理问题:创建索引应该是管理员才可以调用的。如何在定时执行下进行访问控制呢。 

你可能感兴趣的:(bean,quartz,配置管理,JUnit,Lucene)