ik
es 中文分词主流都推荐 ik,使用简单,作者也一直持续更新,算是Lucene 体系最好的中文分词了。但是索引的文本往往是复杂的,不仅包含中文,还有英文和数字以及一些符号。ik 分中文很好用,但是对英文和数字的组合的时候却不尽人意,而使用场景中像型号这等英文加数字在常见不过了。
举个栗子:
curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
"tokenizer" : "ik_max_word",
"text" : "m123-test detailed output 一丝不挂 青丝变白发"
}
'
得到结果:
{
"tokens" : [
{
"token" : "m123-test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "LETTER",
"position" : 0
},
{
"token" : "m",
"start_offset" : 0,
"end_offset" : 1,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "123",
"start_offset" : 1,
"end_offset" : 4,
"type" : "ARABIC",
"position" : 2
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "detailed",
"start_offset" : 10,
"end_offset" : 18,
"type" : "ENGLISH",
"position" : 4
},
{
"token" : "output",
"start_offset" : 19,
"end_offset" : 25,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "一丝不挂",
"start_offset" : 26,
"end_offset" : 30,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "一丝",
"start_offset" : 26,
"end_offset" : 28,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "一",
"start_offset" : 26,
"end_offset" : 27,
"type" : "TYPE_CNUM",
"position" : 8
},
{
"token" : "丝",
"start_offset" : 27,
"end_offset" : 28,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "不挂",
"start_offset" : 28,
"end_offset" : 30,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "挂",
"start_offset" : 29,
"end_offset" : 30,
"type" : "CN_WORD",
"position" : 11
},
{
"token" : "青丝",
"start_offset" : 31,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "丝",
"start_offset" : 32,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 13
},
{
"token" : "变白",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "白发",
"start_offset" : 34,
"end_offset" : 36,
"type" : "CN_WORD",
"position" : 15
},
{
"token" : "发",
"start_offset" : 35,
"end_offset" : 36,
"type" : "CN_WORD",
"position" : 16
}
]
}
这里中文和数字 m123 会被分成 m, 123,所以当你搜索m123
的时候, 实际搜索的是 123。
使用 es 内置的 tokenizer 可以解决字母 + 数字问题, 以 standard 为例:
{
"tokens" : [
{
"token" : "m123",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 0
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "",
"position" : 1
},
{
"token" : "detailed",
"start_offset" : 10,
"end_offset" : 18,
"type" : "",
"position" : 2
},
{
"token" : "output",
"start_offset" : 19,
"end_offset" : 25,
"type" : "",
"position" : 3
},
{
"token" : "一",
"start_offset" : 26,
"end_offset" : 27,
"type" : "",
"position" : 4
},
{
"token" : "丝",
"start_offset" : 27,
"end_offset" : 28,
"type" : "",
"position" : 5
},
{
"token" : "不",
"start_offset" : 28,
"end_offset" : 29,
"type" : "",
"position" : 6
},
{
"token" : "挂",
"start_offset" : 29,
"end_offset" : 30,
"type" : "",
"position" : 7
},
{
"token" : "青",
"start_offset" : 31,
"end_offset" : 32,
"type" : "",
"position" : 8
},
{
"token" : "丝",
"start_offset" : 32,
"end_offset" : 33,
"type" : "",
"position" : 9
},
{
"token" : "变",
"start_offset" : 33,
"end_offset" : 34,
"type" : "",
"position" : 10
},
{
"token" : "白",
"start_offset" : 34,
"end_offset" : 35,
"type" : "",
"position" : 11
},
{
"token" : "发",
"start_offset" : 35,
"end_offset" : 36,
"type" : "",
"position" : 12
}
]
}
m123 可以搜得到,但这里同样带来新的问题 中文被分成单个字了,所以你搜 一
一挂
都可以搜到结果。
鱼和熊掌不可兼得哎,如果要兼得改 ik 分词的方法,或者单独为中文或者非中文加个附加字段。前者不会,只能选后者。