ElasticSearch教程——安装IK分词器插件

ElasticSearch汇总请查看:ElasticSearch教程——汇总篇

简介

IK Analyzer是一个开源的,基于Java语言开发的轻量级的中文分词工具包,最初的时候,它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件,从3.0版本之后,IK逐渐成为面向java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现,IK实现了简单的分词 歧义排除算法,标志着IK分词器从单纯的词典分词向模拟语义分词衍化

 

基础环境

1.基础环境建立在前两篇博客的基础之上,这边IK的版本务必要和elasticsearch一致,否则会报错

2.安装maven

 

下载

git clone https://github.com/medcl/elasticsearch-analysis-ik.git

下载方式有两种,一种是直接用git命令下载,另一种是在windows上下载好后上传到服务器上再进行解压

 

打包编译

按照上述方式下载好后,将项目进行打包

执行如下脚本(需要先按照maven,此处不再赘述,网上相关博文很多):

mvn package

编译完成之后切换路径到项目下的target/releases,找到对应zip包,我这边是elasticsearch-analysis-ik-6.4.0.zip。

将该zip文件拷贝至/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik(此处ik文件夹是自己创建的)下,并进行解压

unzip elasticsearch-analysis-ik-6.4.0.zip

重启ElasticSearch

systemctl restart elasticsearch.service

 

测试IK分词器

ik 带有两个分词器
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

 

更多具体内容可以看官方案例:官方测试案例

 

注意:在新版本中需要在请求头中设置请求格式,否则会报错,错误为

"error" : "Content-Type header [application/x-www-form-urlencoded] is not supported"

另外新版本中已经不支持String了,用text代替,输入String会报下错误

org.elasticsearch.index.mapper.MapperParsingException: No handler for type [string] declared on field [content]
	at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:274) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:199) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.RootObjectMapper$TypeParser.parse(RootObjectMapper.java:131) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:112) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:92) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:626) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:263) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0.jar:6.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

 

ik_smart正确请求方式如下(直接复制粘贴到xshell,回车即可):

curl -H "Content-Type: application/json" 'http://XXX.xx.xx.xx:9200/index/_analyze?pretty=true' -d '  
{  
 "analyzer": "ik_smart",
 "text": "中华人民共和国万岁万岁万万岁"			
}'

返回结果:

{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "万岁",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "万万岁",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

 

ElasticSearch教程——安装IK分词器插件_第1张图片

 

 

ik_max_word正确请求方式如下:

curl -H "Content-Type: application/json" 'http://XXX.XXX.xxx:9200/index/_analyze?pretty=true' -d '  
{  
 "analyzer": "ik_max_word",
 "text": "中华人民共和国万岁万岁万万岁"            
}'

返回结果:

{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "万",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "TYPE_CNUM",
      "position" : 10
    },
    {
      "token" : "岁",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "COUNT",
      "position" : 11
    },
    {
      "token" : "万岁",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "万",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "TYPE_CNUM",
      "position" : 13
    },
    {
      "token" : "岁",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "COUNT",
      "position" : 14
    },
    {
      "token" : "万万岁",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "万万",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "TYPE_CNUM",
      "position" : 16
    },
    {
      "token" : "万岁",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "岁",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "COUNT",
      "position" : 18
    }
  ]
}

结果高亮

官方git上也有案例,具体看上面链接

curl -XPOST http://localhost:9200/index/fulltext/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'
curl -XPOST http://localhost:9200/index/fulltext/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["", ""],
        "post_tags" : ["", ""],
        "fields" : {
            "content" : {}
        }
    }
}'

返回结果

{
	"took": 8,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 1,
		"max_score": 0.5753642,
		"hits": [{
			"_index": "index",
			"_type": "fulltext",
			"_id": "1",
			"_score": 0.5753642,
			"_source": {
				"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
			},
			"highlight": {
				"content": ["驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]
			}
		}]
	}
}

扩展配置文件

IKAnalyzer.cfg.xml can be located at {conf}/analysis-ik/config/IKAnalyzer.cfg.xml or {plugins}/elasticsearch-analysis-ik-*/config/IKAnalyzer.cfg.xml,意思就是说IKAnalyzer.cfg.xml可以放在上述位置(不过我看了下,该文件在我的/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik/config目录下自带的,相关的扩展字典也在该目录下),文件内容如下




	IK Analyzer 扩展配置
	
	custom/mydict.dic;custom/single_word_low_freq.dic
	 
	custom/ext_stopword.dic
 	
	location
 	
	http://xxx.com/xxx.dic

 

热更新IK分词

目前该插件支持热更新 IK 分词,通过上面在 IK 配置文件中添加如下配置


location

location

其中 location 是指一个 url,比如 http://yoursite.com/getCustomDict,该请求只需满足以下两点即可完成分词热更新。

  1. 该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。

  2. 该 http 请求返回的内容格式是一行一个分词,换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了,不需要重启 ES 实例。

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里,放在 nginx 或其他简易 http server 下,当 .txt 文件修改时,http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇,并更新这个 .txt 文件。

 

 

 

你可能感兴趣的:(ElasticSearch)