紧接上篇ansj分词整合后,开始需要做索引。索引分增量和全量,可以直接连接数据库去做,也可以通过程序去做,以下实现连接数据库去做,相对来说比较简单。
1.修改multicore/new_core/conf/solrconfig.xml文件(上篇提到过的),在里面新增
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> <requestHandler name="/deltaimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">delta-data-config.xml</str> </lst> </requestHandler>
其中第一段是专门做全量索引的,第二段做增量索引(主要是靠DataImportHandler类实现)
2.新增multicore/new_core/conf/data-config.xml文件
<dataConfig> <dataSource name="jdbc" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.0.81:3306/new_mall?zeroDateTimeBehavior=convertToNull&characterEncoding=utf8&useUnicode=true" user="root" password="HyS_Db@2014"/> <document name="mall_goods"> <entity name="MallGoods" pk="id" query="select * from mall_goods limit ${dataimporter.request.length} offset ${dataimporter.request.offset}" transformer="RegexTransformer"> <field column="goods_id" name="id" /> <field column="title" name="title" /> <field column="subtitle" name="subtitle" /> <field column="cover_img_path" name="coverImgPath" /> <field column="description" name="description" /> <field column="update_date" name="updateDate" /> </entity> </document> </dataConfig>
dataSource不用说了,数据源配置来的
entity文档中的实体配置(注意pk="id" 不能随便改 ,需要和schema.xml中的<uniqueKey>id</uniqueKey>匹配,否则会报“ org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id”)
query 查询语句(可分页)
transformer 暂时不清楚干啥
field定义列名
3.新增multicore/new_core/conf/delta-data-config.xml文件
<dataConfig> <dataSource name="jdbc" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.0.81:3306/new_mall?zeroDateTimeBehavior=convertToNull&characterEncoding=utf8&useUnicode=true" user="root" password="HyS_Db@2014"/> <document name="mall_goods"> <entity name="MallGoods" pk="id" query="select * from mall_goods" deltaImportQuery="select * from mall_goods where goods_id='${dih.delta.id}'" deltaQuery="select goods_id as id from mall_goods where update_date > '${dih.last_index_time}'" transformer="RegexTransformer"> <field column="goods_id" name="id" /> <field column="title" name="title" /> <field column="subtitle" name="subtitle" /> <field column="cover_img_path" name="coverImgPath" /> <field column="description" name="description" /> <field column="update_date" name="updateDate" /> </entity> </document> </dataConfig>
deltaQuery查询出有更改过的id
deltaImportQuery根据id查询
4.修改multicore/new_core/conf/schema.xml文件,定义field索引配置
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="title" type="text_ansj" indexed="true" stored="true" required="true" multiValued="false"/> <field name="subtitle" type="text_ansj" indexed="true" stored="true" required="false" multiValued="false"/> <field name="coverImgPath" type="string" indexed="false" stored="true" required="true" multiValued="false" /> <field name="description" type="text_ansj" indexed="true" stored="true" required="false" multiValued="false"/> <field name="updateDate" type="text_ansj" indexed="true" stored="true" required="false" multiValued="false"/>
注意上面选择一下text_ansj
5.solr的war包可能还缺少部分jar包,需要把mysql的jar,以及solr项目中dist目录下的jar包都放到solr的web站点中
6.开始运行
全量:http://solr.xxxx.com:8082/new_core/dataimport?command=full-import&commit=true&clean=false&offset=0&length=100000(其中0到100000的数据建立索引)
增量:http://solr.ehaoyao.com:8082/new_core/deltaimport?command=delta-import&entity=MallGoods
entity:是document下面的标签(data-config.xml)。使用这个参数可以有选择的执行一个或多个entity 。使用多个entity参数可以使得多个entity同时运行。如果不选择此参数那么所有的都会被运行。
clean:选择是否要在索引开始构建之前删除之前的索引,默认为true
commit:选择是否在索引完成之后提交。默认为true
optimize:是否在索引完成之后对索引进行优化。默认为true
debug:是否以调试模式运行,适用于交互式开发(interactive development mode)之中。
请注意,如果以调试模式运行,那么默认不会自动提交,请加参数“commit=true”
注意:在做增量索引的时候
很容易出现deltaQuery has no column to resolve to declared primary key pk='id'这种异常
主要是因为ID" must be used as it is in 'deltaQuery' select statement as "select ID from ..."
(if you different name for ID column in database, then use 'as' keyword in select statement. In my case I had 'studentID' as primary key in student table. So I used it as "select studentID as ID from ..."
--> The same applies to 'deletedPkQuery'
At present its working fine for me. Any updation in database is reflected in Solr as well.
所以,delta-data-config.xml文件需要注意一下pk的值
参考连接:
http://shiyanjun.cn/archives/444.html
http://blog.duteba.com/technology/article/70.htm
http://www.devnote.cn/article/89.html
http://qiaqia26.iteye.com/blog/1004996
http://zzstudy.offcn.com/archives/8104
http://blog.csdn.net/duck_genuine/article/details/5426897
------------------------------------------------------------------------------------------------------------------------------
最后补充:
有时候需要删除索引数据,可以这样删除
http://xxxx/new_core/update/?stream.body=<delete><query>*:*</query></delete>&stream.contentType=text/xml;charset=utf-8&commit=true
new_core 表示你要删除哪个核下面的索引