elasticsearch分词器

一、es内置分词器

只支持英文分词,不支持中文分词

2、es内置分词器

  • standard:默认分词,单词会被拆分,大小会转换为小写。

  • simple:按照非字母分词。大写转为小写。

  • whitespace:按照空格分词。忽略大小写。

  • stop:去除无意义单词,比如the/a/an/is

  • keyword:不做分词。把整个文本作为一个单独的关键词。

    # 示例json
    {
        "analyzer": "standard",
        "text": "My name is Peter Parker,I am a Super Hero. I don't like the Criminals."
    }
    

3、内置分词器用例

  • 请求(POST)

    192.168.56.101:9200/_analyze
    

    关键词 : _analyze

  • json参数

    {
        "analyzer": "standard",
        "text": "This is a good job"
    }
    

    关键词:"analyze

二、ik分词器

1、ik分词器安装

主要用于中文分词,英文也支持

https://github.com/medcl/elasticsearch-analysis-ik

  • 下载对应版本

  • 上床es所在服务器

  • 加压到es目录下的plugins

    /usr/local/es/elasticsearch-8.4.3/plugins/ik/

  • 重启es即可

2、分词器

  • ik_max_word

  • ik_smart

3、用例

  • 请求(POST)

    同es内置分词器

    192.168.56.101:9200/_analyze
    
  • json参数

    使用ik_max_wor分词器

    {
        "analyzer": "ik_max_word",
        "text": "上下班车流量很大。"
    }
    
  • 结果

    {
        "tokens": [
            {
                "token": "上下班",
                "start_offset": 0,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "上下",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "下班",
                "start_offset": 1,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "班车",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 3
            },
            {
                "token": "车流量",
                "start_offset": 3,
                "end_offset": 6,
                "type": "CN_WORD",
                "position": 4
            },
            {
                "token": "车流",
                "start_offset": 3,
                "end_offset": 5,
                "type": "CN_WORD",
                "position": 5
            },
            {
                "token": "流量",
                "start_offset": 4,
                "end_offset": 6,
                "type": "CN_WORD",
                "position": 6
            },
            {
                "token": "很大",
                "start_offset": 6,
                "end_offset": 8,
                "type": "CN_WORD",
                "position": 7
            }
        ]
    }
    
  • json参数

    使用ik_smart分词器

    {
        "analyzer": "ik_smart",
        "text": "上下班车流量很大。"
    }
    
  • 结果

    {
        "tokens": [
            {
                "token": "上下班",
                "start_offset": 0,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "车流量",
                "start_offset": 3,
                "end_offset": 6,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "很大",
                "start_offset": 6,
                "end_offset": 8,
                "type": "CN_WORD",
                "position": 2
            }
        ]
    }
    

4、ik_max_worik_smart分词器的区别

  • ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;

  • ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。

5、ik自定义词汇

  • 配置文件地址:

    /usr/local/es/elasticsearch-8.4.3/plugins/ik/config/IKAnalyzer.cfg.xml

    根据自己的安装目录对应其位置

  • 修改配置信息

    
    
    
      IK Analyzer 扩展配置
      
      custom/mydict.dic;custom/single_word_low_freq.dic
       
      custom/ext_stopword.dic
      
      location
      
      http://xxx.com/xxx.dic
    
    
  • 创建自定义字典.dic

    /usr/local/es/elasticsearch-8.4.3/plugins/ik/config/custom/

    • 创建mydict.dicsingle_word_low_freq.dic文件
小小小
小小少年
测测
子天
  • 测试

    小小小少年测测想成为天子的儿子天下无敌。

    • json参数
{
    "analyzer": "ik_max_word",
    "text": "小小小少年测测想成为天子的儿子天下无敌。"
}
  • 结果
{
    "tokens": [
        {
            "token": "小小小",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "小小",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "小小少年",
            "start_offset": 1,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "小小",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "少年",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "测测",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "想成",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "成为",
            "start_offset": 8,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "天子",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "的",
            "start_offset": 12,
            "end_offset": 13,
            "type": "CN_CHAR",
            "position": 9
        },
        {
            "token": "儿子",
            "start_offset": 13,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "子天",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "天下无敌",
            "start_offset": 15,
            "end_offset": 19,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "天下",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "无敌",
            "start_offset": 17,
            "end_offset": 19,
            "type": "CN_WORD",
            "position": 14
        }
    ]
}

你可能感兴趣的:(elasticsearch分词器)