1.Character Filter 单词过滤器,对原始的文本进行处理
2.Tokenizer 将原始文本按照一定的规则切分成不同的单词
3.Token Filter 针对2过程处理的单词在进行加工,例如大小写转换等
请求:
POST _analyze
{
"analyzer": "standard", //指定默认的分词器
"text": "hello world" //分词文本
}
返回值:
被分成两个单词.
{
"tokens": [ //分词结果
{
"token": "hello", //
"start_offset": 0, //起始偏移量
"end_offset": 5, //结束偏移量
"type": "" ,
"position": 0 //位置
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "" ,
"position": 1
}
]
}
请求
POST /test_index/_analyze // test_index 测试的索引.如果没有该索引或报错.
{
"field": "name", //测试的字段
"text": "hello world"
}
返回值
{
"tokens": [ //分词结果
{
"token": "hello", //
"start_offset": 0, //起始偏移量
"end_offset": 5, //结束偏移量
"type": "" ,
"position": 0 //位置
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "" ,
"position": 1
}
]
}
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"], //自定义的分词器,全部转化成小写
"text": "heLLo World"
}
POST _analyze
{
"analyzer": "standard", //指定默认的分词器
"text": "Hello World" //分词文本
}
最后返回hello , world
POST _analyze
{
"analyzer": "simple",
"text": "heLLo World 2018"
}
最后结果只含有字母,数字下划线等都没有了
POST _analyze
{
"analyzer": "whitespace",
"text": "heLLo World 2018"
}
结果为hello , World ,2018 不区分大小写
POST _analyze
{
"analyzer": "stop",
"text": "the heLLo World 2018"
}
返回 hello world
POST _analyze
{
"analyzer": "keyword",
"text": "the heLLo World 2018"
}
返回
{
"tokens": [
{
"token": "the heLLo World 2018",
"start_offset": 0,
"end_offset": 21,
"type": "word",
"position": 0
}
]
}
POST _analyze
{
"analyzer": "pattern",
"text": "the heLLo 'World -2018"
}
返回
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "hello",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "world",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "2018",
"start_offset": 19,
"end_offset": 23,
"type": "word",
"position": 3
}
]
}
今天/民政局/发/放女朋友
今天/民政/局/发放/女/朋友
1.实现中英文单词的切分,支持ik_smart,ik_maxword等模式
2.可自定义词库,支持热更新分词字典
https://github.com/medcl/elasticsearch-analysis-ik
example
POST _analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip" //去掉html中的符号
],
"text": "I'm so happy!
"
}
example:
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "www.baidu.com [email protected] hello world"
}
result:
{
"tokens": [
{
"token": "www.baidu.com",
"start_offset": 0,
"end_offset": 13,
"type": "" ,
"position": 0
},
{
"token": "[email protected]",
"start_offset": 14,
"end_offset": 25,
"type": "" ,
"position": 1
},
{
"token": "hello",
"start_offset": 26,
"end_offset": 31,
"type": "" ,
"position": 2
},
{
"token": "world",
"start_offset": 32,
"end_offset": 37,
"type": "" ,
"position": 3
}
]
}
POST _analyze
{
"tokenizer": "ngram", //会一次查询出来你每个字后面的词,edge_ngram只会查询出第一个词后面的词
"text": "你好"
}
{
"tokens": [
{
"token": "你",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "你好",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "好",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
}
]
}
example:
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/baidu/com"
}
result:
{
"tokens": [
{
"token": "/baidu",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "/baidu/com",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
}
]
}
自带的,可以有多个,按照顺序执行
POST _analyze
{
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
{
"type":"ngram",
"min_gram":4, 最小四个
"max_gram":4 最大四个
}
],
"text": "a hello world"
}
需要在索引的配置中设定
结构:
PUT test_index
{
"settings": {
"analysis": {
"char_filter": {},
"tokenizer": {},
"filter": {},
"analyzer": {}
}
}
}
创建一个自己的分词器
PUT test_index1
{
"settings": {
"analysis": {
"analyzer": {
"my_first_analyzer":{
"type":"custom",
"tokenizer":"standard",
"char_filter":[
"html_strip"
],
"filter":[
"lowercase",
"asciifolding"
]
}
}
}
}
}
自定义分词验证
POST /test_index1/_analyze
{
"analyzer": "my_first_analyzer",
"text": "I'm so happy!
"
}
结果
{
"tokens": [
{
"token": "i'm",
"start_offset": 3,
"end_offset": 11,
"type": "" ,
"position": 0
},
{
"token": "so",
"start_offset": 12,
"end_offset": 14,
"type": "" ,
"position": 1
},
{
"token": "happy",
"start_offset": 18,
"end_offset": 27,
"type": "" ,
"position": 2
}
]
}
Mapping中每个字段的analyzer属性实现的,不指定的时候默认是standard
demo
PUT test_index2
{
"mappings": {
"doc":{
"properties": {
"title":{
"type":"text",
"analyzer":"whitespace" //指定分词器
}
}
}
}
}
1.通过analyzer指定分词器
POST /test_index/_search
{
"query": {
"match": {
"message": {
"query": "hello",
"analyzer": "standard" //指定分词器
}
}
}
}
2.通过index mapping 设置search_analyzer实现
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"title":{
"type": "text",
"analyzer": "whitespace",
"search_analyzer": "standard"
}
}
}
}
}
https://www.elastic.co/guide/index.html