文本分析使 Elasticsearch 能够执行全文搜索,搜索返回所有相关结果,而不仅仅是精确匹配。就像我们平时在搜索引擎搜索关键词一样,比如我们搜索“蓉城”,则结果会出现与成都相关的信息,而且“成都”会是和“蓉城”一样的关键字。达到这样的效果,是通过分析器来实现的。
一个分析器包含如下三个模块:
Elasticsearch 包含一个默认分析器,称为 标准分析器,它开箱即用,适用于大多数用例。当然我们也可以自定义分析器,我们可以选中不同的内置分析器,也可以配置自定义的分析器
一般情况下,我们在文档对索引和搜索的时候应该对同样的字段使用同一个文本分析器,这样有助于我们的搜索要求。
GET /_analyze
POST /_analyze
GET /<index>/_analyze
POST /<index>/_analyze
根据 Unicode 文本分割算法定义的单词边界将文本划分为术语。它删除了大多数标点符号、小写术语,并支持删除停用词,且适用于大多数语言。
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "" ,
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "" ,
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "" ,
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "" ,
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "" ,
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "" ,
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "" ,
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "" ,
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "" ,
"position": 8
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "" ,
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "" ,
"position": 10
}
]
}
从结果可以看到,standard 分析器默认没有删除 the 这样的停用词(stop),但是将 QUICK 转换为了小写。
max_token_length
单个词语的最大长度。如果词语长度超过该长度,则按max_token_length间隔将其拆分。默认为255。
stopwords
stopwords_path
包含停用词的文件的路径。
PUT person1
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
说明:此示例为创建一个索引 person1,并配置了一个名为 my_english_analyzer 的文本分析器,其基础类型为 standard (意味着是以 standard 来扩展),并设置了拆分的词语最大长度为5,以及使用了 standard 预定义的停用词表。
如果需要修改一个已存在的索引的分析器,可使用更新索引设置的API
PUT /person1/_settings
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
注:修改前请先使用 _close 接口关闭索引,否则会报如下错误:
Can’t update non dynamic settings [[index.analysis.analyzer.my_english_analyzer.stopwords, index.analysis.analyzer.my_english_analyzer.max_token_length, index.analysis.analyzer.my_english_analyzer.type]] for open indices [[person1/DQ3_0a2iRPGYrN7SLlxh3g]]
修改完成之后再打开索引即可。
再次测试我们自定义的文本分析器
POST /person1/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
使用我们自定义的分析器 my_english_analyzer
{
"tokens": [
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "" ,
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "" ,
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "" ,
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "" ,
"position": 4
},
{
"token": "jumpe",
"start_offset": 24,
"end_offset": 29,
"type": "" ,
"position": 5
},
{
"token": "d",
"start_offset": 29,
"end_offset": 30,
"type": "" ,
"position": 6
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "" ,
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "" ,
"position": 9
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "" ,
"position": 10
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "" ,
"position": 11
}
]
}
可以看到结果与直接使用标准分析器是不一样的。特别是 jumped 被拆分为了:“jumpe” 和 “d” ,而且停用词(the)被删除了。
将文本分解为任何非字母字符处的标记,例如数字、空格、连字符和撇号,丢弃非字母字符,并将大写字母更改为小写字母。
POST /_analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析的结果如下:
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
注:以上是简化的结果
每当遇到空白字符时,分析器都会将文本分解为术语。
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析的结果:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
stop 分析器与 simple 分析器相同 ,但增加了对删除停用词的支持。它默认使用 _english_ 停用词。
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析结果:
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
keyword 关键字分析器,将整个输入字符串作为单个标记返回。
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析结果:
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
按照提供的正则表达式对文本进行拆分。正则表达式默认为\W+(索引的非单词字符)。
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析结果:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
其包含的语言分析器如下:
arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch,
english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian,
irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian,
sorani, spanish, swedish, turkish, thai.
可以看到其中有各种语言,但是没有中文,这也是我们前面的示例都是英文的原因,中文的分析我们后面再说。