用solr从数据库建立中文Lucene索引

用solr从数据库建立中文Lucene索引



Solr是一个基于Lucene java库的企业级搜索服务器,运行在Servlet容器中。
1.  下载solr: http://www.apache.org/dyn/closer.cgi/lucene/solr/
    当前最新版是1.4
    解压到一个目录中,假设是solrpath
2.  添加Handler
    编辑solrpath/example/solr/conf文件夹下的solrconfig.xml文件,在config元素中添加

   
Xml代码   收藏代码
  1. < requestHandler   name = "/dataimport"   class = "org.apache.solr.handler.dataimport.DataImportHandler" >   
  2.     < lst   name = "defaults" >   
  3.       < str   name = "config" > data-config.xml </ str >   
  4.     </ lst >   
  5. </ requestHandler >   
    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
          <str name="config">data-config.xml</str>
        </lst>
    </requestHandler>
    



3.  在此文件夹中新建一个data-config.xml文件,内容如下
    detail列是一个clob类型的,要用到ClobTransformer,这是1.4版本中才有的
    column="DETAIL" clob="true"中的列名DETAIL一定要大写,要不会不起作用

Xml代码   收藏代码
  1. < dataConfig >   
  2.   < dataSource   type = "JdbcDataSource"    
  3.               driver = "com.mysql.jdbc.Driver"   
  4.               url = "jdbc:mysql://localhost/dbname"    
  5.               user = "user-name"    
  6.               password = "password" />   
  7.   < document >   
  8.     < entity   name = "myentyty"   transformer = "ClobTransformer"   
  9.         query = "select id, title, detail from mytable" >   
  10.         < field   column = "DETAIL"   clob = "true" />   
  11.     </ entity >   
  12.   </ document >   
  13. </ dataConfig >   
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost/dbname" 
              user="user-name" 
              password="password"/>
  <document>
    <entity name="myentyty" transformer="ClobTransformer"
        query="select id, title, detail from mytable">
        <field column="DETAIL" clob="true"/>
    </entity>
  </document>
</dataConfig>



4.  修改schema.xml,找到<fieldType name="text",将分词器修改为中文分词器,这里用的是包装过的Paoding分词,这个东西好像已经不更新了,以后看看IKAnalyzer吧。

Xml代码   收藏代码
  1.   < fieldType   name = "text"   class = "solr.TextField"   positionIncrementGap = "100" >     
  2.     < analyzer   type = "index" >     
  3.         < tokenizer   class = "net.paoding.analysis.analyzer.ChineseTokenizerFactory"   mode = "most-words" />     
  4.         < filter   class = "solr.StopFilterFactory"   ignoreCase = "true"   words = "stopwords.txt" />     
  5.         < filter   class = "solr.WordDelimiterFilterFactory"   generateWordParts = "1"   generateNumberParts = "1"   catenateWords = "1"   catenateNumbers = "1"   catenateAll = "0" />     
  6.         < filter   class = "solr.LowerCaseFilterFactory" />     
  7.         < filter   class = "solr.RemoveDuplicatesTokenFilterFactory" />     
  8.     </ analyzer >     
  9.     < analyzer   type = "query" >     
  10.         < tokenizer   class = "net.paoding.analysis.analyzer.ChineseTokenizerFactory"   mode = "most-words" />                     
  11.         < filter   class = "solr.SynonymFilterFactory"   synonyms = "synonyms.txt"   ignoreCase = "true"   expand = "true" />     
  12.         < filter   class = "solr.StopFilterFactory"   ignoreCase = "true"   words = "stopwords.txt" />     
  13.         < filter   class = "solr.WordDelimiterFilterFactory"   generateWordParts = "1"   generateNumberParts = "1"   catenateWords = "0"   catenateNumbers = "0"   catenateAll = "0" />     
  14.         < filter   class = "solr.LowerCaseFilterFactory" />     
  15.         < filter   class = "solr.RemoveDuplicatesTokenFilterFactory" />     
  16.     </ analyzer >     
  17. </ fieldType >   
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">  
       <analyzer type="index">  
           <tokenizer class="net.paoding.analysis.analyzer.ChineseTokenizerFactory" mode="most-words"/>  
           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>  
           <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>  
           <filter class="solr.LowerCaseFilterFactory"/>  
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>  
       </analyzer>  
       <analyzer type="query">  
           <tokenizer class="net.paoding.analysis.analyzer.ChineseTokenizerFactory" mode="most-words"/>                  
           <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>  
           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>  
           <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>  
           <filter class="solr.LowerCaseFilterFactory"/>  
           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>  
       </analyzer>  
   </fieldType>



   原来的schema.xml中没有的字段自己添加上。schema.xml默认是UTF-8编码,添加中文注释要在UTF-8编码下添加,要不会报错的

  
Xml代码   收藏代码
  1. < field   name = "detail"   type = "text"   indexed = "true"   stored = "true"   />   
  2. <!-- 添加到默认的查询字段,可根据需要修改 -->   
  3. < copyField   source = "title"   dest = "text" />   
  4. < copyField   source = "detail"   dest = "text" />   
   <field name="detail" type="text" indexed="true" stored="true" />
   <!-- 添加到默认的查询字段,可根据需要修改 -->
   <copyField source="title" dest="text"/>
   <copyField source="detail" dest="text"/>



5.  包装Paoding的分词器

Java代码   收藏代码
  1. package  net.paoding.analysis.analyzer;  
  2.   
  3. import  java.io.Reader;  
  4. import  java.util.Map;  
  5.   
  6. import  net.paoding.analysis.analyzer.impl.MaxWordLengthTokenCollector;  
  7. import  net.paoding.analysis.analyzer.impl.MostWordsTokenCollector;  
  8. import  net.paoding.analysis.knife.PaodingMaker;  
  9.   
  10. import  org.apache.lucene.analysis.Tokenizer;  
  11. import  org.apache.solr.analysis.BaseTokenizerFactory;  
  12.   
  13. /**  
  14.  * Created by IntelliJ IDEA.   
  15.  * User: ronghao   
  16.  * Date: 2007-11-3   
  17.  * Time: 14:40:59 中文切词 对庖丁切词的封装  
  18.  */   
  19. public   class  ChineseTokenizerFactory  extends  BaseTokenizerFactory {  
  20.     /**  
  21.      * 最多切分 默认模式  
  22.      */   
  23.     public   static   final  String MOST_WORDS_MODE =  "most-words" ;  
  24.     /**  
  25.      * 按最大切分  
  26.      */   
  27.     public   static   final  String MAX_WORD_LENGTH_MODE =  "max-word-length" ;  
  28.   
  29.     private  String mode =  null ;  
  30.   
  31.     public   void  setMode(String mode) {  
  32.         if  (mode ==  null  || MOST_WORDS_MODE.equalsIgnoreCase(mode) ||  "default" .equalsIgnoreCase(mode)) {  
  33.             this .mode = MOST_WORDS_MODE;  
  34.         } else   if  (MAX_WORD_LENGTH_MODE.equalsIgnoreCase(mode)) {  
  35.             this .mode = MAX_WORD_LENGTH_MODE;  
  36.         } else  {  
  37.             throw   new  IllegalArgumentException( "不合法的分析器Mode参数设置:"  + mode);  
  38.         }  
  39.     }  
  40.   
  41.     @Override   
  42.     public   void  init(Map<String, String> args) {  
  43.         super .init(args);  
  44.         setMode(args.get("mode" ));  
  45.     }  
  46.   
  47.     public  Tokenizer create(Reader input) {  
  48.         return   new  PaodingTokenizer2(input, PaodingMaker.make(), createTokenCollector());  
  49.     }  
  50.   
  51.     private  TokenCollector createTokenCollector() {  
  52.         if  (MOST_WORDS_MODE.equals(mode))  
  53.             return   new  MostWordsTokenCollector();  
  54.         if  (MAX_WORD_LENGTH_MODE.equals(mode))  
  55.             return   new  MaxWordLengthTokenCollector();  
  56.         throw   new  Error( "never happened" );  
  57.     }  
  58. }  
package net.paoding.analysis.analyzer;

import java.io.Reader;
import java.util.Map;

import net.paoding.analysis.analyzer.impl.MaxWordLengthTokenCollector;
import net.paoding.analysis.analyzer.impl.MostWordsTokenCollector;
import net.paoding.analysis.knife.PaodingMaker;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * Created by IntelliJ IDEA. 
 * User: ronghao 
 * Date: 2007-11-3 
 * Time: 14:40:59 中文切词 对庖丁切词的封装
 */
public class ChineseTokenizerFactory extends BaseTokenizerFactory {
	/**
	 * 最多切分 默认模式
	 */
	public static final String MOST_WORDS_MODE = "most-words";
	/**
	 * 按最大切分
	 */
	public static final String MAX_WORD_LENGTH_MODE = "max-word-length";

	private String mode = null;

	public void setMode(String mode) {
		if (mode == null || MOST_WORDS_MODE.equalsIgnoreCase(mode) || "default".equalsIgnoreCase(mode)) {
			this.mode = MOST_WORDS_MODE;
		} else if (MAX_WORD_LENGTH_MODE.equalsIgnoreCase(mode)) {
			this.mode = MAX_WORD_LENGTH_MODE;
		} else {
			throw new IllegalArgumentException("不合法的分析器Mode参数设置:" + mode);
		}
	}

	@Override
	public void init(Map<String, String> args) {
		super.init(args);
		setMode(args.get("mode"));
	}

	public Tokenizer create(Reader input) {
		return new PaodingTokenizer2(input, PaodingMaker.make(), createTokenCollector());
	}

	private TokenCollector createTokenCollector() {
		if (MOST_WORDS_MODE.equals(mode))
			return new MostWordsTokenCollector();
		if (MAX_WORD_LENGTH_MODE.equals(mode))
			return new MaxWordLengthTokenCollector();
		throw new Error("never happened");
	}
}


因为新版本的接口可能是变了,还需要对PaodingTokenizer类修改一下,原来是继承TokenStream类,改为继承Tokenizer

Java代码   收藏代码
  1. public   final   class  PaodingTokenizer2  extends  Tokenizer  implements  Collector {  
public final class PaodingTokenizer2 extends Tokenizer implements Collector {



这两个修改的类就放在solr.war中,在压缩包的WEB-INF文件夹中新建classes文件夹,将这两个类连同包层次复制进去就行

6.  将Paoding和数据库驱动放到lib中,进入solrpath/example目录中运行命令
    java -jar start.jar
    进入浏览器,运行
    http://localhost:8983/solr/dataimport?command=full-import
    导入成功后,运行
    http://localhost:8983/solr/admin/
    在搜索框中输入搜索内容进行查询

还有个Paoding的词典文件的问题,一开始是建了个PAODING_DIC_HOME的环境变量,后来删掉也能用了,暂不知道是什么原因,不知它存哪去了

你可能感兴趣的:(apache,.net,mysql,Solr,Lucene)