es 中英文字母分词问题

ik

es 中文分词主流都推荐 ik,使用简单,作者也一直持续更新,算是Lucene 体系最好的中文分词了。但是索引的文本往往是复杂的,不仅包含中文,还有英文和数字以及一些符号。ik 分中文很好用,但是对英文和数字的组合的时候却不尽人意,而使用场景中像型号这等英文加数字在常见不过了。

举个栗子:

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
 "tokenizer" : "ik_max_word",
 "text" : "m123-test detailed output 一丝不挂 青丝变白发"
}
'

得到结果:

{
  "tokens" : [
    {
      "token" : "m123-test",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "123",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "一丝不挂",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "一丝",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "TYPE_CNUM",
      "position" : 8
    },
    {
      "token" : "丝",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "不挂",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "挂",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "青丝",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "丝",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "变白",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "白发",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "发",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 16
    }
  ]
}

这里中文和数字 m123 会被分成 m, 123,所以当你搜索m123的时候, 实际搜索的是 123。

使用 es 内置的 tokenizer 可以解决字母 + 数字问题, 以 standard 为例:

{
  "tokens" : [
    {
      "token" : "m123",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "丝",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "不",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "挂",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "青",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "丝",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "",
      "position" : 9
    },
    {
      "token" : "变",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "",
      "position" : 10
    },
    {
      "token" : "白",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "",
      "position" : 11
    },
    {
      "token" : "发",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "",
      "position" : 12
    }
  ]
}

m123 可以搜得到,但这里同样带来新的问题 中文被分成单个字了,所以你搜 一挂 都可以搜到结果。

鱼和熊掌不可兼得哎,如果要兼得改 ik 分词的方法,或者单独为中文或者非中文加个附加字段。前者不会,只能选后者。

你可能感兴趣的:(ecmascript,ik-analyzer)