ElasticSearch 分词

分词

将文本转换成一系列单词的过程,转换成的单词叫term or token

原理: 倒排索引(b+)

分词器的组成以及调用顺序

1.Character Filter 单词过滤器,对原始的文本进行处理
2.Tokenizer  将原始文本按照一定的规则切分成不同的单词
3.Token Filter  针对2过程处理的单词在进行加工,例如大小写转换等

1.指定analyzer进行测试

请求:

POST _analyze
{
  "analyzer": "standard",    //指定默认的分词器
  "text": "hello world"      //分词文本
}

返回值:

被分成两个单词.

{
  "tokens": [                //分词结果
    {
      "token": "hello",     //
      "start_offset": 0,     //起始偏移量
      "end_offset": 5,      //结束偏移量
      "type": "",
      "position": 0         //位置
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

2.指定filed进行测试

请求

POST /test_index/_analyze        //  test_index   测试的索引.如果没有该索引或报错.
{
  "field": "name",              //测试的字段
  "text": "hello world"
}

返回值

{
  "tokens": [                //分词结果
    {
      "token": "hello",     //
      "start_offset": 0,     //起始偏移量
      "end_offset": 5,      //结束偏移量
      "type": "",
      "position": 0         //位置
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

3.指定自定义的分词器进行测试

POST _analyze
{
  "tokenizer": "standard",     
  "filter": ["lowercase"],    //自定义的分词器,全部转化成小写
  "text": "heLLo World"
}

es 自带的分词器

  • Standard

    • 按词切分,支持多语言 Tokenizer:statandar
    • 小写处理 Token Filter: statandar,lowerCase,Stop(disabled by default)
POST _analyze
{
  "analyzer": "standard",    //指定默认的分词器
  "text": "Hello World"      //分词文本
}
最后返回hello , world
  • Simple

    • 按照非字母切分
    • 小写处理 Tokenizer:lowercase
POST _analyze
{
  "analyzer": "simple",
  "text": "heLLo World  2018"
}
最后结果只含有字母,数字下划线等都没有了
  • Whitespace

    • 按照空格来切分 Tokenizer:Whitespace
POST _analyze
{
  "analyzer": "whitespace",
  "text": "heLLo World  2018"
}
结果为hello , World ,2018  不区分大小写
  • Stop

    • stop word指语气助词等修饰性的词语,比如the,an,的,这等等
    • 与simple 多了stop word的处理 Token Filter: stop
POST _analyze
{
  "analyzer": "stop",
  "text": "the heLLo World  2018"
}
返回 hello world
  • Keyword

    • 不分词,不想对文本进行分词的时候使用
POST _analyze
{
  "analyzer": "keyword",
  "text": "the heLLo World  2018"
}
返回
{
  "tokens": [
    {
      "token": "the heLLo World  2018",
      "start_offset": 0,
      "end_offset": 21,
      "type": "word",
      "position": 0
    }
  ]
}
  • Pattern

    • 通过正则表达式区分 Tokenize: Pattern Token Filters:lowercase,stop(disabled by deafult)
    • 默认是\w+,即非字词的符号作为分隔符
POST _analyze
{
  "analyzer": "pattern",
  "text": "the heLLo 'World  -2018"
}
返回
{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "hello",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "world",
      "start_offset": 11,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "2018",
      "start_offset": 19,
      "end_offset": 23,
      "type": "word",
      "position": 3
    }
  ]
}
  • Language

    • es提供的30多种常见语言的分词器

4.中文分词

  • 难点
    • 中文分词是将一句话切分成一个一个单独的词,在英文中,单词之间是以空格为自然分解符,汉字中词没有一个形式上的分解符.
    • 上下文不同,分词结果迥异,比如交叉歧义问题.比如下面两种分词都合理
      今天/民政局//放女朋友
      今天/民政//发放//朋友
    

4.1 IK分词器

1.实现中英文单词的切分,支持ik_smart,ik_maxword等模式
2.可自定义词库,支持热更新分词字典

https://github.com/medcl/elasticsearch-analysis-ik

5.自定义分词

  • 当自带的分词无法满足需求时,可以自定义分词.
  • 通过自定义Character Filters,Tokenizer和Tokenizer Filters来实现

5.1 Character Filters

example

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"      //去掉html中的符号
  ],
  "text": "

I'm so happy!

"
}
  • Tokenizer 自带的如下
  1. standard 按照标准单词进分割
  2. letter 按照非字符进行分割
  3. whitespace 按照空格进行分割
  4. UAX URL Email 按照standard分割,但是不会分割邮箱和url
example:
POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "www.baidu.com [email protected] hello world"
}

result:
{
  "tokens": [
    {
      "token": "www.baidu.com",
      "start_offset": 0,
      "end_offset": 13,
      "type": "",
      "position": 0
    },
    {
      "token": "[email protected]",
      "start_offset": 14,
      "end_offset": 25,
      "type": "",
      "position": 1
    },
    {
      "token": "hello",
      "start_offset": 26,
      "end_offset": 31,
      "type": "",
      "position": 2
    },
    {
      "token": "world",
      "start_offset": 32,
      "end_offset": 37,
      "type": "",
      "position": 3
    }
  ]
}
  1. NGram 和 Edge NGram连词分割 类似百度谷歌等搜索
POST _analyze
{
  "tokenizer": "ngram",   //会一次查询出来你每个字后面的词,edge_ngram只会查询出第一个词后面的词
  "text": "你好"
}

{
  "tokens": [
    {
      "token": "",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "你好",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    }
  ]
}

  1. Path Hierarchy按照文件路径进行分割
example:
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/baidu/com"
}

result:
{
  "tokens": [
    {
      "token": "/baidu",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "/baidu/com",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    }
  ]
}

5.2 Token Filter

  • 对于tokenizer 输出的单词(term)进行增删改等操作

自带的,可以有多个,按照顺序执行

  1. lowercase 全部小写
  2. stop 删除stop words
  3. NGram和Edge NGram 连词分割
  4. Synonym 添加近义词term
POST _analyze
{
  "tokenizer": "standard",
  "filter": [
    "stop",
    "lowercase",
    {
      "type":"ngram",
      "min_gram":4,    最小四个
      "max_gram":4      最大四个
    }
    ], 
  "text": "a hello world"
}

5.3 自定义分词

需要在索引的配置中设定

结构:
PUT test_index
{
  "settings": {
    "analysis": {
      "char_filter": {}, 
      "tokenizer": {},
      "filter": {},
      "analyzer": {}
    }
  }
}

创建一个自己的分词器

PUT test_index1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_first_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "char_filter":[
              "html_strip"
            ],
          "filter":[
              "lowercase",
              "asciifolding"
            ]
        }
      }
    }
  }
}

自定义分词验证

POST /test_index1/_analyze
{
  "analyzer": "my_first_analyzer",
  "text": "

I'm so happy!

"
} 结果 { "tokens": [ { "token": "i'm", "start_offset": 3, "end_offset": 11, "type": "", "position": 0 }, { "token": "so", "start_offset": 12, "end_offset": 14, "type": "", "position": 1 }, { "token": "happy", "start_offset": 18, "end_offset": 27, "type": "", "position": 2 } ] }

6.分词使用说明

  • 创建或者更新文档时(Index Time)候会对应的文档进行分词处理
  • 查询时(Search Time),会对查询语句进行分词

6.1索引的分词通过配置Index

Mapping中每个字段的analyzer属性实现的,不指定的时候默认是standard

demo

PUT test_index2
{
  "mappings": {
    "doc":{
      "properties": {
        "title":{
          "type":"text",
          "analyzer":"whitespace"   //指定分词器
        }
      }
    }
  }
}

6.2 查询时候指定分词

1.通过analyzer指定分词器

POST /test_index/_search
{
  "query": {
    "match": {
      "message": {
        "query": "hello",
        "analyzer": "standard"    //指定分词器
      }
    }
  }
}

2.通过index mapping 设置search_analyzer实现

PUT /test_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "whitespace", 
          "search_analyzer": "standard"
        }
      }
    }
  }
}

7.官网

多看官方文档

https://www.elastic.co/guide/index.html


java代码地址: https://gitee.com/mengcan/es-demo/tree/master

你可能感兴趣的:(SpringBoot,elasticsearch)