ElasticSearch7.2.0
1、定义ik+english分词器
2、定义ik+english+同义词分词器
3、定义english+pinyin分词器
4、使用分词器
5、测试分词效果
对人工智能感兴趣点下面链接
现在人工智能非常火爆,很多朋友都想学,但是一般的教程都是为博硕生准备的,太难看懂了。最近发现了一个非常适合小白入门的教程,不仅通俗易懂而且还很风趣幽默。所以忍不住分享一下给大家。点这里可以跳转到教程。
https://www.cbedai.net/u014646662
在这里强调一下 "index_patterns": ["*"],指匹配所有索引,即所有索引都可用该模板的分词器,单不等于默认使用(除非你设置的级别比较高)。只有template_default这个模板的内容才会默认使用。下同
Post _template/ik_en_analyzer
{
"index_patterns": ["*"],
"order" : 0,
"version": 0,
"settings": {
"number_of_shards": 2,
"number_of_replicas":1 ,
"analysis": {
"analyzer": {
"ik_en_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["en_stemmer"]
}
},
"filter": {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
}
提前准备文件
在elasticsearch/config/analysis目录下创建synonym.txt(analysis这个目录自己创建)
synonym.txt
马铃薯,土豆
番茄,西红柿
i,me,我
you,你
模板定义
Post _template/template_synonym
{
"index_patterns": ["*"],
"order" : 0,
"version": 0,
"settings": {
"number_of_shards": 2,
"number_of_replicas":1 ,
"analysis": {
"analyzer": {
"ik_en_synonym_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["en_stemmer","interim_synonym"]
}
},
"filter": {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"interim_synonym": {
"type": "synonym",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
个人认为拼音分词器不建议与ik一起用(有相应的需求除外)
Post _template/template_pinyin
{
"index_patterns": ["*"],
"order" : 0,
"version": 0,
"settings": {
"number_of_shards": 2,
"number_of_replicas":1 ,
"analysis": {
"analyzer": {
"en_pinyin_analyzer": {
"type": "custom",
"tokenizer": "pinyin",
"filter": ["en_stemmer"]
}
},
"filter": {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
}
在定义索引的时候就可以直接使用了
这里定义了三个字段,分别用以上刚刚定义好的分词器
PUT test_analyzer
{
"settings": {
"number_of_shards": "1",
"number_of_replicas": "0"
},
"mappings": {
"properties": {
"pinyin": {
"type": "text",
"analyzer": "en_pinyin_analyzer"
},
"ik": {
"type": "text",
"analyzer": "ik_en_analyzer"
},
"synonym": {
"type": "text",
"analyzer": "ik_en_synonym_analyzer"
}
}
}
}
测试pinyin+english
GET test_analyzer/_analyze
{
"field": "pinyin",
"text": "赵本山"
}
GET test_analyzer/_analyze
{
"field": "pinyin",
"text": "zhaobenshan"
}
第一种分词结果(赵本山)
{
"tokens" : [
{
"token" : "zhao",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "zb",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "ben",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "shan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}
]
}
第二种分词结果(zhaobenshan)
{
"tokens" : [
{
"token" : "zhao",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "zhaobenshan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "ben",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "shan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}
]
}
测试ik+english
GET test_analyzer/_analyze
{
"field": "ik",
"text": "我有100个梨 I have 100 pears"
}
测试结果
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "有",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "100",
"start_offset" : 2,
"end_offset" : 5,
"type" : "ARABIC",
"position" : 2
},
{
"token" : "个",
"start_offset" : 5,
"end_offset" : 6,
"type" : "COUNT",
"position" : 3
},
{
"token" : "梨",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "i",
"start_offset" : 8,
"end_offset" : 9,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "have",
"start_offset" : 10,
"end_offset" : 14,
"type" : "ENGLISH",
"position" : 6
},
{
"token" : "100",
"start_offset" : 15,
"end_offset" : 18,
"type" : "ARABIC",
"position" : 7
},
{
"token" : "pear", //把复数形式的s去掉了
"start_offset" : 19,
"end_offset" : 24,
"type" : "ENGLISH",
"position" : 8
}
]
}
测试同义词
GET test_analyzer/_analyze
{
"field": "synonym",
"text": "土豆的果实是长在底下的"
}
测试结果:既有土豆也有马铃薯
{
"tokens" : [
{
"token" : "土豆",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "马铃薯",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "的",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "果实",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "实是",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "长在",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "底下",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "的",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 6
}
]
}