Analysis - 文本分析是把全文本转换成一系列单词(term/token)的过程,也叫分词
Analysis 是通过 Analyzer 来实现的
除了在数据写入时转换词条,匹配 Query 语句时候也需要用相同的分析器对查询语句进行分析
分词器是专门处理分词的组件,由三部分组成
例子:
将 Mastering Elasticsearch & Elasticsearch in Action
经过上面的步骤就会产生
直接指定 Analyzer 进行测试
指令
GET /_analyze
{
"analyzer":"standard",
"text":"Mastering Elasticsearch & Elasticsearch in Action"
}
结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 0
},
{
"token" : "elasticsearch",
"start_offset" : 10,
"end_offset" : 23,
"type" : "",
"position" : 1
},
{
"token" : "elasticsearch",
"start_offset" : 25,
"end_offset" : 38,
"type" : "",
"position" : 2
},
{
"token" : "in",
"start_offset" : 39,
"end_offset" : 41,
"type" : "",
"position" : 3
},
{
"token" : "action",
"start_offset" : 42,
"end_offset" : 48,
"type" : "",
"position" : 4
}
]
}
指定索引的字段进行测试
指令
POST test_home/_analyze
{
"field": "job_name",
"text":"Mastering Elasticsearch, Elasticsearch in Action"
}
结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 0
},
{
"token" : "elasticsearch",
"start_offset" : 10,
"end_offset" : 23,
"type" : "",
"position" : 1
},
{
"token" : "elasticsearch",
"start_offset" : 25,
"end_offset" : 38,
"type" : "",
"position" : 2
},
{
"token" : "in",
"start_offset" : 39,
"end_offset" : 41,
"type" : "",
"position" : 3
},
{
"token" : "action",
"start_offset" : 42,
"end_offset" : 48,
"type" : "",
"position" : 4
}
]
}
自定义分词进行测试
指令
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text":"Mastering Elasticsearch, Elasticsearch in Action"
}
结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "",
"position" : 0
},
{
"token" : "elasticsearch",
"start_offset" : 10,
"end_offset" : 23,
"type" : "",
"position" : 1
},
{
"token" : "elasticsearch",
"start_offset" : 25,
"end_offset" : 38,
"type" : "",
"position" : 2
},
{
"token" : "in",
"start_offset" : 39,
"end_offset" : 41,
"type" : "",
"position" : 3
},
{
"token" : "action",
"start_offset" : 42,
"end_offset" : 48,
"type" : "",
"position" : 4
}
]
}
中文句子,切分成一个个词(不是一个个字)
英文字,单词有自然地空格进行切分
一句中文,在不同的上下文,有不同的理解
推荐的一些中文分词器:
极客时间 ES 学习笔记