1.analysis 和 analyzer
analysis是指把全文本转换成一系列单词(term/token)的过程,也叫分词。
analysis是通过分词器analyzer来实现的。
2.ES 自带分词器
Standard Analyzer——默认分词器,按词切分,小写处理
Simple Analyzer——按照非字母切分(符号被过滤),小写处理
Stop Analyzer——小写处理,停用词过滤(the,a,is)
Whitespace Analyzer——按照空格切分,不转小写
Keyword Analyze——不分词,直接将输入当作输出
Patter Analyzer——正则表达式,默认\W+(非字符分隔)
Language——提供了30多种常见语言的分词器
Customer Analyzer——自定义分词器
3._analyzer API
# 直接指定分词器进行测试
GET /_analyze
{
"analyzer":"standard",
"text":"master elasticsearch!"
}
#指定索引的字段进行测试
POST books/_analyze
{
"field":"standard",
"text":"master elasticsearch!"
}
#自定义分词器进行测试
POST /_analyze
{
"tokenizer":"standard",
"filter":["lowercase"],
"text":"master elasticsearch!"
}
中文分词器
# IK
https://github.com/medcl/elasticsearch-analysis-ik
# 清华大学开发的中文分词器
https://github.com/thunlp/THULAC
mapping 中配置自定义analyzer
es自带分词器无法满足需求时,可以自定义分词器,通过组合不同的组件实现,
包括三个组件:
1.character filter
对文本进行处理,增加、删除、替换字符,可以配置多个,数组的形式
自带的有:html strip(去除html标签),mapping(字符串替换),pattern replace(正则匹配替换)
2.tokenizer
分词器
3.token filter
对分词器tokenizer输出的单词term,进行增加、修改、删除
自带的有:lowercase,stop,synonym
# 过滤html 标签
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text":"hello world"
}
# 字符转换
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"mapping",
"mapping":["- => _"]
}
],
"text":"123-456-789,i-love-u"
}
# 替换表情符号
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"mapping",
"mapping":[":) => happy"]
}
],
"text":"i am felling :),i-love-u"
}
# 正则表达式
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"pattern_replace",
"pattern":"http://(.*)",
"replacement":"$1"
}
],
"text":"http://www.elastic.co"
}
# 路径分词器,按照一级一级的目录切成不同term
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a/b/c/d/e"
}
# stop 过滤
POST _analyze
{
"tokenizer":"whitespace",
"filter":["lowercase","stop"],
"text":["The rain in spain falls mainly on the plain."]
}
设置索引的settings
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":["emoticons"],
"tokenizer":"punctuation",
"filter":["lowercase","english_stop"]
}
},
"tokenizer":{
"punctuation":{
"type":"pattern",
"pattern":"[ .,!?]"
}
},
"char_filter":{
"emotions":{
"type":"mapping",
"mappings":[
":) => happy"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
}
}
# 使用上面自己定义的分词器
POST my_index/_analyze
{
"analyzer":"my_custom_analyzer",
"text":"i am a :) person,and you ?"
}